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Preface 



From an applications viewpoint, the main reason to study the subject of these notes is to help 
deal with the complexity of describing random, time-varying functions. A random variable can 
be interpreted as the result of a single measurement. The distribution of a single random vari- 
able is fairly simple to describe. It is completely specified by the cumulative distribution function 
F(x), a function of one variable. It is relatively easy to approximately represent a cumulative 
distribution function on a computer. The joint distribution of several random variables is much 
more complex, for in general, it is described by a joint cumulative probability distribution function, 
F(x±,X2, ■ ■ ■ ,x n ), which is much more complicated than n functions of one variable. A random 
process, for example a model of time-varying fading in a communication channel, involves many, 
possibly infinitely many (one for each time instant t within an observation interval) random vari- 
ables. Woe the complexity! 

These notes help prepare the reader to understand and use the following methods for dealing 
with the complexity of random processes: 

• Work with moments, such as means and covariances. 

• Use extensively processes with special properties. Most notably, Gaussian processes are char- 
acterized entirely be means and covariances, Markov processes are characterized by one-step 
transition probabilities or transition rates, and initial distributions. Independent increment 
processes are characterized by the distributions of single increments. 

• Appeal to models or approximations based on limit theorems for reduced complexity descrip- 
tions, especially in connection with averages of independent, identically distributed random 
variables. The law of large numbers tells us that, in a certain context, a probability distri- 
bution can be characterized by its mean alone. The central limit theorem, similarly tells us 
that a probability distribution can be characterized by its mean and variance. These limit 
theorems are analogous to, and in fact examples of, perhaps the most powerful tool ever dis- 
covered for dealing with the complexity of functions: Taylor's theorem, in which a function 
in a small interval can be approximated using its value and a small number of derivatives at 
a single point. 

• Diagonalize. A change of coordinates reduces an arbitrary n-dimensional Gaussian vector 
into a Gaussian vector with n independent coordinates. In the new coordinates the joint 
probability distribution is the product of n one-dimensional distributions, representing a great 
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reduction of complexity. Similarly, a random process on an interval of time, is diagonalized by 
the Karhunen-Loeve representation. A periodic random process is diagonalized by a Fourier 
series representation. Stationary random processes are diagonalized by Fourier transforms. 

• Sample. A narrowband continuous time random process can be exactly represented by its 
samples taken with sampling rate twice the highest frequency of the random process. The 
samples offer a reduced complexity representation of the original process. 

• Work with baseband equivalent. The range of frequencies in a typical radio transmission 
is much smaller than the center frequency, or carrier frequency, of the transmission. The 
signal could be represented directly by sampling at twice the largest frequency component. 
However, the sampling frequency, and hence the complexity, can be dramatically reduced by 
sampling a baseband equivalent random process. 

These notes were written for the first semester graduate course on random processes, offered 
by the Department of Electrical and Computer Engineering at the University of Illinois at Urbana- 
Champaign. Students in the class are assumed to have had a previous course in probability, which 
is briefly reviewed in the first chapter of these notes. Students are also expected to have some 
familiarity with real analysis and elementary linear algebra, such as the notions of limits, definitions 
of derivatives, Riemann integration, and diagonalization of symmetric matrices. These topics are 
reviewed in the appendix. Finally, students are expected to have some familiarity with transform 
methods and complex analysis, though the concepts used are reviewed in the relevant chapters. 

Each chapter represents roughly two weeks of lectures, and includes homework problems. Solu- 
tions to the even numbered problems without stars can be found at the end of the notes. Students 
are encouraged to first read a chapter, then try doing the even numbered problems before looking 
at the solutions. Problems with stars, for the most part, investigate additional theoretical issues, 
and solutions are not provided. 

Hopefully some students reading these notes will find them useful for understanding the diverse 
technical literature on systems engineering, ranging from control systems, image processing, com- 
munication theory, and communication network performance analysis. Hopefully some students 
will go on to design systems, and define and analyze stochastic models. Hopefully others will 
be motivated to continue study in probability theory, going on to learn measure theory and its 
applications to probability and analysis in general. 

A brief comment is in order on the level of rigor and generality at which these notes are written. 
Engineers and scientists have great intuition and ingenuity, and routinely use methods that are 
not typically taught in undergraduate mathematics courses. For example, engineers generally have 
good experience and intuition about transforms, such as Fourier transforms, Fourier series, and 
z-transforms, and some associated methods of complex analysis. In addition, they routinely use 
generalized functions, in particular the delta function is frequently used. The use of these concepts 
in these notes leverages on this knowledge, and it is consistent with mathematical definitions, 
but full mathematical justification is not given in every instance. The mathematical background 
required for a full mathematically rigorous treatment of the material in these notes is roughly at 
the level of a second year graduate course in measure theoretic probability, pursued after a course 
on measure theory. 
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Organization 

The first four chapters of the notes are used heavily in the remaining chapters, so that most 
readers should cover those chapters before moving on. 

Chapter 1 is meant primarily as a review of concepts found in a typical first course on probability 
theory, with an emphasis on axioms and the definition of expectation. Readers desiring a 
more extensive review of basic probability are referred to the author's notes for ECE 313 at 
University of Illinois. 

Chapter 2 focuses on various ways in which a sequence of random variables can converge, and 
the basic limit theorems of probability: law of large numbers, central limit theorem, and the 
asymptotic behavior of large deviations. 

Chapter 3 focuses on minimum mean square error estimation and the orthogonality principle. 

Chapter 4 introduces the notion of random process, and briefly covers several examples and classes 
of random processes. Markov processes and martingales are introduced in this chapter, but 
are covered in greater depth in later chapters. 

The following four additional topics can be covered independently of each other. 

Chapter 5 describes the use of Markov processes for modeling and statistical inference. Applica- 
tions include natural language processing. 

Chapter 6 describes the use of Markov processes for modeling and analysis of dynamical systems. 
Applications include the modeling of queueing systems. 

Chapter 7-9 These three chapters develop calculus for random processes based on mean square 
convergence, moving to linear filtering, orthogonal expansions, and ending with causal and 
noncausal Wiener filtering. 

Chapter 10 This chapter explores martingales with respect to nitrations. 

In recent one-semester course offerings, that author covered Chapters 1-5, Sections 6.1-6.8, 
Chapter 7, Sections 8.1-8.4, and Section 9.1. Time did not permit to cover the Foster-Lyapunov 
stability criteria, noncausal Wiener filtering, and the chapter on margingales. 

A number of background topics are covered in the appendix, including basic notation. 



Chapter 1 

Getting Started 



This chapter reviews many of the main concepts in a first level course on probability theory with 
more emphasis on axioms and the definition of expectation than is typical of a first course. 

1.1 The axioms of probability theory 

Random processes are widely used to model systems in engineering and scientific applications. 
These notes adopt the most widely used framework of probability and random processes, namely 
the one based on Kolmogorov's axioms of probability. The idea is to assume a mathematically solid 
definition of the model. This structure encourages a modeler to have a consistent, if not accurate, 
model. 

A probability space is a triplet {Q,,J-, V). The first component, £1, is a nonempty set. Each 
element to of Q, is called an outcome and Q is called the sample space. The second component, J-, 
is a set of subsets of called events. The set of events T is assumed to be a a-algebra, meaning it 
satisfies the following axioms: (See Appendix 11.1 for set notation). 

A.l eT 

A.2 If A £ T then A c £ T 

A. 3 If A, B £ J- then A U B £ T . Also, if A\, A2, ... is a sequence of elements in T then 

If T is a cr-algebra and A, B £ T, then AB £ T by A.2, A.3 and the fact AB = (A c U B c ) c . By 
the same reasoning, if A\, A2, . ■ ■ is a sequence of elements in a a-algebra T, then H£i ^« ^ 3~ ' ■ 

Events Ai, i £ I, indexed by a set I are called mutually exclusive if the intersection AiAj = 
for all i, j £ I with i / j. The final component, P, of the triplet (0, .T 7 , P) is a probability measure 
on J 7 satisfying the following axioms: 

P.l P(A) > for all A £ T 
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P.2 If A, B € T and if A and B are mutually exclusive, then P(A\JB) = P(A) + P{B). Also, 
if A\, A2, ... is a sequence of mutually exclusive events in T then P (U£i -^i) = X^i ^(^)- 

P.3 P(fi) = 1. 

The axioms imply a host of properties including the following. For any subsets A, B, C of T: 

• If Ac B then P(A) < P(B) 

• P(A U B) = P(A) + P(P) - P(AB) 

• P(A U B U C) = P(A) + P(B) + P - P(AB) - P(AC) - P(BC) + P(APC) 

• P(A) + P(A C ) = 1 

• P(0) = 0. 

Example 1.1.1 (Toss of a fair coin) Using "PT" for "heads" and "T" for "tails," the toss of a fair 
coin is modelled by 

tt = {H,T} 

T = {{P},{T},{P,r},0} 

P{H} = P{T} = l -, P{H, T} = 1, P(0) = 

Note that, for brevity, we omitted the square brackets and wrote P{H} instead of P({P}). 



Example 1.1.2 (Standard unit-interval probability space) Take £1 = {8 : < 9 < 1}. Imagine an 
experiment in which the outcome uj is drawn from fi with no preference towards any subset. In 
particular, we want the set of events J- to include intervals, and the probability of an interval [a, b] 
with 0<a<6<ltobe given by: 

P( [ a ,b] ) = b -a. (1.1) 

Taking a = b, we see that T contains singleton sets {o}, and these sets have probability zero. Since 
T is to be a a-algrebra, it must also contain all the open intervals (a, b) in £1, and for such an open 
interval, P( (a, b) ) = b — a. Any open subset of is the union of a finite or countably infinite set 
of open intervals, so that T should contain all open and all closed subsets of £1. Thus, T must 
contain the intersection of any set that is the intersection of countably many open sets, and so on. 
The specification of the probability function P must be extended from intervals to all of T . It is 
not a priori clear how large J- can be. It is tempting to take T to be the set of all subsets of £1. 
However, that idea doesn't work-see the starred homework showing that the length of all subsets 
of R can't be defined in a consistent way. The problem is resolved by taking T to be the smallest 
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cr-algebra containing all the subintervals of 0, or equivalently, containing all the open subsets of £1. 
This cr-algebra is called the Borel a-algebra for [0, 1], and the sets in it are called Borel sets. While 
not every subset of is a Borel subset, any set we are likely to encounter in applications is a Borel 
set. The existence of the Borel a-algebra is discussed in an extra credit problem. Furthermore, 
extension theorems of measure theory 1 imply that P can be extended from its definition (1.1) for 
interval sets to all Borel sets. 

The smallest cr-algebra, B, containing the open subsets of R is called the Borel cr-algebra for R, 
and the sets in it are called Borel sets. Similarly, the Borel cr-algebra B n of subsets of R n is the 
smallest cr-algebra containing all sets of the form [ai,&i] x [02,62] x ••• x [a n ,6 n ]. Sets in B n are 
called Borel subsets o/R n . The class of Borel sets includes not only rectangle sets and countable 
unions of rectangle sets, but all open sets and all closed sets. Virtually any subset of R n arising in 
applications is a Borel set. 

Example 1.1.3 (Repeated binary trials) Suppose we would like to represent an infinite sequence 
of binary observations, where each observation is a zero or one with equal probability. For example, 
the experiment could consist of repeatedly flipping a fair coin, and recording a one each time it 
shows heads and a zero each time it shows tails. Then an outcome 10 would be an infinite sequence, 
u> = (ui,U2, ■ ■ ■ ), such that for each i > 1, Ui £ {0, 1}. Let 0, be the set of all such u/s. The set of 
events can be taken to be large enough so that any set that can be defined in terms of only finitely 
many of the observations is an event. In particular, for any binary sequence (61, • • • , b n ) of some 
finite length n, the set {to £ : Ui = b{ for 1 < i < n} should be in T , and the probability of such 
a set is taken to be 2~ n . 

There are also events that don't depend on a fixed, finite number of observations. For example, 
let F be the event that an even number of observations is needed until a one is observed. Show 
that F is an event and then find its probability. 

Solution: For k > 1, let E^ be the event that the first one occurs on the k observation. So 
Ek = {to : to\ = L02 = • • • = Wfe-i — and Uk — !}■ Then Ek depends on only a finite number of 
observations, so it is an event, and P{Ek} = 2 . Observe that F = E2 U E4 U Eq U . . . , so F is an 
event by Axiom A. 3. Also, the events E2,E^, . . . are mutually exclusive, so by the full version of 
Axiom P. 2: 

1\ /1\ 2 \ 1/4 1 



P(F) = P(E 2 ) + P(E 4 ) + ...= !++ + 



4 / V 4 / / 1 -( 1 / 4 ) 3 



The following lemma gives a continuity property of probability measures which is analogous to 
continuity of functions on R n , reviewed in Appendix 11.3. If B\, B2, ... is a sequence of events such 



1 See, for example, H.L. Royden, Real Analysis, Third edition. Macmillan, New York, 1988, or S.R.S. Varadhan, 
Probability Theory Lecture Notes, American Mathematical Society, 2001. The a-algebra T and P can be extended 
somewhat further by requiring the following completeness property: if B C A £ T with P(A) = 0, then B £ T (and 
also P(B) = 0). 
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that B\ C B 2 C -B3 C • • • , then we can think that Bj converges to the set U^Pj as j 
lemma states that in this case, P{Bj) converges to the probability of the limit set as j 

Lemma 1.1.4 (Continuity of Probability) Suppose B\, B2, ... is a sequence of events. 



oc. The 

CXD. 



(a)IfB 1 cB 2 C 
(b)IfB 1 DB 2 D 



then lim^oo P(Bj) = P ((J~i 5<) 
then lim^oo P(P,) = P (|Xi Bi) 



Proof Suppose B\ <Z B 2 C ■ ■ ■ . Let Pi = Pi, D 2 = B 2 — B\, and, in general, let D{ = Bi — Pj_i 
for i > 2, as shown in Figure 1.1. Then P{Bj) = Yj{=\ ^*(A) for each j > 1, so 

















D, 




B r D i 
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Figure 1.1: A sequence of nested sets. 



lim P(Bj) = liniVPp. 



J^OO 



J^OO 



1=1 



(a) 



l = J £p(a 



i=l 



(&) 



p Ua = p U* 



u=l 



u=l 



where (a) is true by the definition of the sum of an infinite series, and (6) is true by axiom P.2. This 
proves Lemma 1.1.4(a). Lemma 1.1.4(b) can be proved similarly, or can be derived by applying 
Lemma 1.1.4(a) to the sets B c a. 



Example 1.1.5 (Selection of a point in a square) Take to be the square region in the plane, 

n = {(x,y):0<x,y<l}. 

Let T be the Borel cr-algebra for £1, which is the smallest cr-algebra containing all the rectangular 
subsets of £1 that are aligned with the axes. Take P so that for any rectangle R, 

P(R) = area of R. 

(It can be shown that T and P exist.) Let T be the triangular region T = {(x,y) : < y < x < 1}. 
Since T is not rectangular, it is not immediately clear that T £ J 7 , nor is it clear what P(T) is. 
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Figure 1.2: Approximation of a triangular region. 



That is where the axioms come in. For n > 1, let T n denote the region shown in Figure 1.2. Since 
T n can be written as a union of finitely many mutually exclusive rectangles, it follows that T n £ T 



and it is easily seen that P(T n ) 



1+2H 



n+1 
2n 



. Since Ti D T 2 D T 4 D T 8 • • • and rijT 2J = T, it 



follows that T £ JP" and P(T) = lim^^ P(T„) = \. 

The reader is encouraged to show that if C is the diameter one disk inscribed within £1 then 
P{C) = (area of C) = f. 



1.2 Independence and conditional probability 

Events A\ and A 2 are defined to be independent if P^i^) — P(^.i)P(^2)- More generally, events 
A\, A 2 , ■ ■ ■ , Ak are defined to be independent if 



lyAi-^A^ ■ ■ ■ Ai 



P(A h )P(A l2 ) ■ ■ ■ P(Ai 



whenever j and i±, i 2 , ■ ■ ■ ,ij are integers with j > 1 and 1 < i\ < i 2 < ■ ■ ■ < ij < k. 

For example, events A\, A 2 , A% are independent if the following four conditions hold: 

P(AiA 2 ) = P(A 1 )P(A 2 ) 

P(AiA 3 ) = P(A 1 )P(A 3 ) 

P(A 2 A 3 ) = P(A 2 )P(A 3 ) 

PiAU^s) = P(A 1 )P(A 2 )P(A 3 ) 



A weaker condition is sometimes useful: Events Ai,...,Ak are defined to be pairwise inde- 
pendent if Ai is independent of Aj whenever 1 < i < j < k. Independence of k events requires 
that 2 k — k — 1 equations hold: one for each subset of {1, 2, . . . , k} of size at least two. Pairwise 
independence only requires that ( 2 ) = - ^ - equations hold. 

If A and B are events and P{B) / 0, then the conditional probability of A given B is defined by 



P(A | B) 



P{AB) 
-pjBj 
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It is not denned if P{B) = 0, which has the following meaning. If you were to write a computer 
routine to compute P{A \ B) and the inputs are P{AB) = and P{B) = 0, your routine shouldn't 
simply return the value 0. Rather, your routine should generate an error message such as "input 
error-conditioning on event of probability zero." Such an error message would help you or others 
find errors in larger computer programs which use the routine. 

As a function of A for B fixed with P{B) ^ 0, the conditional probability of A given B is itself 
a probability measure for £1 and T . More explicitly, fix B with P{B) ^ 0. For each event A define 
P'[A) = P{A | B). Then (Q, T , P') is a probability space, because P' satisfies the axioms PI — P3. 
(Try showing that). 

If A and B are independent then A c and B are independent. Indeed, if A and B are independent 
then 

P(A C B) = P(B)-P(AB) = (1-P(A))P(B) = P(A C )P(B). 

Similarly, if A, B, and C are independent events then AB is independent of C . More generally, 
suppose E\, E%, ■ ■ ■ , E n are independent events, suppose n = m + - ■ - + n k with m > 1 for each i, and 
suppose Fi is defined by Boolean operations (intersections, complements, and unions) of the first m 
events E\, . . . , E ni , F2 is defined by Boolean operations on the next n<i events, E ni+ i, . . . , E ni+ri2 , 
and so on, then Fi, . . . , Fk are independent. 

Events Ei , . . . , E\. are said to form a 'partition of Q if the events are mutually exclusive and 
Q = Ei U • • • U Efc. Of course for a partition, P{Ei) + • • • + P{Ek) = 1. More generally, for any 
event A, the law of total probability holds because A is the union of the mutually exclusive sets 
AEi,AE 2 ,...,AE k : 

P{A) = p(AEi) + --- + P{AE k ). 

If P{Ei) 7^ for each i, this can be written as 

P(A) = P{A\Ei)P{Ei) + --- + P{A\E k )P{E k ). 

Figure 1.3 illustrates the condition of the law of total probability. 




Figure 1.3: Partitioning a set A using a partition of Q. 

Judicious use of the definition of conditional probability and the law of total probability leads 
to Bayes ' formula for P(Ei \ A) (if P(A) / 0) in simple form 

P(V I 4^ P ( AE ^ P ( A I E i) P ( E i) 

niMAj " P(A) ~ P(A) 
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or in expanded form: 

P(E . lA) = p ( A i ^pm 

{ il } P(A\E 1 )P(E 1 ) + --- + P(A\E k )P(E k y 

The remainder of this section gives the Borel-Cantelli lemma. It is a simple result based on 
continuity of probability and independence of events, but it is not typically encountered in a first 
course on probability. Let (A n : n > 0) be a sequence of events for a probability space (fi, J-, P). 

Definition 1.2.1 The event {A n infinitely often} is the set of to £ such that to £ A n for infinitely 
many values of n. 

Another way to describe {A n infinitely often} is that it is the set of u> such that for any k, there is 
an n > k such that u £ A n . Therefore, 

{A n infinitely often} = n fc >i (U n > k A n ) . 

For each k, the set U n > k A n is a countable union of events, so it is an event, and {A n infinitely often} 
is an intersection of countably many such events, so that {A n infinitely often} is also an event. 

Lemma 1.2.2 (Borel-Cantelli lemma) Let (A n : n > 1) be a sequence of events and let p n = 
P(A n ). 

( a ) tf Y^=\Pn < oo, then P{A n infinitely often} = 0. 

(b) If YlnLi Pn = oo and A\,A2, • ■ ■ are mutually independent, then P{A n infinitely often} = 1. 

Proof, (a) Since {A n infinitely often} is the intersection of the monotonically nonincreasing se- 
quence of events U n >fcA n , it follows from the continuity of probability for monotone sequences of 
events (Lemma 1.1.4) that P{A n infinitely often} = lini/^oo P(U n >fcA n ). Lemma 1.1.4, the fact 
that the probability of a union of events is less than or equal to the sum of the probabilities of the 
events, and the definition of the sum of a sequence of numbers, yield that for any k > 1, 

m oo 

P(U n>k A n ) = lim P(U™ =k A n ) < lim Vp„ = Vp n 

n=k n=k 

Combining the above yields P{A n infinitely often} < limfc^oo Y^=kPn- ^ Y^=\Pn < °°) then 
lim^oo Y^= k Pn — 0, which implies part (a) of the lemma. 

(b) Suppose that Y^=\Pn — +°° an d that the events A±,A2, ■ ■ . are mutually independent. For 
any k > 1, using the fact 1 — u < exp(— u) for all u, 

m 
P{U n >kA n ) = lim P(U™ =k A n )= lim 1 - T\(l - Pn ) 

n=k 
/ m \ / oo \ 

> lim 1 — exp — y^ p n \ = 1 — exp — \^ Pn] = 1 — exp( — cxi) = 1. 

\ n=k / V n=k / 

Therefore, P{A n infinitely often} = linifc^oo P(U n > k A n ) = 1. I 
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Example 1.2.3 Consider independent coin tosses using biased coins, such that P(A n ) = p n = -, 
where A n is the event of getting heads on the n th toss. Since Y^=\ „ = +°°) t ne P ar t of the 
Borel-Cantelli lemma for independent events implies that P{A n infinitely often} = 1. 



Example 1.2.4 Let (Q, J-, P) be the standard unit-interval probability space defined in Example 
1.1.2, and let A n = [0, -). Then p n = - and A n +i C A n for n > 1. The events are not independent, 
because for m < n, P(A m A n ) = P(A n ) = \ / P(A m )P(A n ). Of course e A n for all n. But 
for any w G (0,1], w ^ A n for n > — . Therefore, {A n infinitely often} = {0}. The single point 
set {0} has probability zero, so P{A n infinitely often} = 0. This conclusion holds even though 
^2^LiPn — +oo, illustrating the need for the independence assumption in Lemma 1.2.2(b). 



1.3 Random variables and their distribution 

Let a probability space (£l,J-, P) be given. By definition, a random variable is a function X from 
$1 to the real line R that is J- measurable, meaning that for any number c, 

{u : X[u) < c} G T. (1.2) 

If Q is finite or countably infinite, then T can be the set of all subsets of 0, in which case any 
real- valued function on is a random variable. 

If (0, J 7 , P) is given as in the uniform phase example with T equal to the Borel subsets of [0, 27r] , 
then the random variables on (0, J 7 , P) are called the Borel measurable functions on £1. Since the 
Borel cr-algebra contains all subsets of [0, 2n] that come up in applications, for practical purposes 
we can think of any function on [0, 2n] as being a random variable. For example, any piecewise 
continuous or piecewise monotone function on [0, 2tt] is a random variable for the uniform phase 
example. 

The cumulative distribution function (CDF) of a random variable X is denoted by Fx- It is 
the function, with domain the real line R, defined by 

F x (c) = P{lo:X(lo)<c} (1.3) 

= P{X < c} (for short) (1.4) 

If X denotes the outcome of the roll of a fair die ( "die" is singular of "dice" ) and if Y is uniformly 
distributed on the interval [0, 1], then Fx and Fy are shown in Figure 1.4 

The CDF of a random variable X determines P{X < c} for any real number c. But what about 
P{X < c} and P{X = c}? Let c\,C2, ■ ■ ■ be a monotone nondecreasing sequence that converges to 
c from the left. This means q < Cj < c for i < j and linij^oo Cj = c. Then the events {X < Cj} 
are nested: {X < Cj} C {X < Cj} for i < j, and the union of all such events is the event {X < c}. 
Thus, by Lemma 1.1.4 

P{X<c} = limP{X<a} = lim Fx(ci) = F x {c-). 
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Figure 1.4: Examples of CDFs. 

Therefore, P{X = c} = Fx{c) — Fx{c—) = AFx(c), where AFx(c) is defined to be the size of 
the jump of F at c. For example, if X has the CDF shown in Figure 1.5 then P{X = 0} = ^. 
The requirement that Fx be right continuous implies that for any number c (such as c = for this 
example), if the value Fx{c) is changed to any other value, the resulting function would no longer 
be a valid CDF. 

The collection of all events A such that P{X £ ^4} is determined by Fx is a a-algebra containing 
the intervals, and thus this collection contains all Borel sets. That is, P{X £ ^4} is determined by 
Fx for any Borel set A. 



Figure 1.5: An example of a CDF. 

Proposition 1.3.1 A function F is the CDF of some random variable if and only if it has the 
following three properties: 

F.l F is nondecreasing 

F.2 lmxc^+oo F(x) = 1 and lim x ^_oo F(x) = 

F.3 F is right continuous 

Proof The "only if part is proved first. Suppose that F is the CDF of some random variable X. 
Then if x < y, F(y) = P{X < y} = P{X < x} + P{x < X < y} > P{X < x} = F(x) so that F.l 
is true. Consider the events B n = {X < n}. Then B n C B m for n < m. Thus, by Lemma 1.1.4, 

lim Fin) = lim P(B n ) = P I I J B n ) = P(il) = 1. 

\n=l / 
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This and the fact F is nondecreasing imply the following. Given any e > 0, there exists N e so large 
that F(x) > 1 — e for all x > N e . That is, F(x) — > 1 as x — > +oo. Similarly, 

lim F(n) = Hm P(5_ n ) = P I C] B_ n | = P(0) = 0. 

\n=l / 

so that -F(x) — > as x — > — oo. Property F.2 is proved. 

The proof of F.3 is similar. Fix an arbitrary real number x. Define the sequence of events A n 
for n > 1 by A n = {X < x -\ — }. Then A n c A m for n > m so 

lim F(s + -) = lim PM n ) = P I C] A k ) = P{X < x} = F x (x). 

n— >oo n n— >oo V ' / 

\fe=l / 

Convergence along the sequence x + — , together with the fact that F is nondecreasing, implies that 
F(x+) = F{x). Property F.3 is thus proved. The proof of the "only if portion of Proposition 
1.3.1 is complete 

To prove the "if part of Proposition 1.3.1, let F be a function satisfying properties F.1-F.3. It 
must be shown that there exists a random variable with CDF F. Let Q = M. and let T be the set 
B of Borel subsets of R. Define P on intervals of the form (a, b] by P[(a, b]] = F(b) — F(a). It can 
be shown by an extension theorem of measure theory that P can be extended to all of T so that 
the axioms of probability are satisfied. Finally, let X(u) = to for all uefl. 

Then 

P(Xe(a,b}) = P((a,b}) = F(b)-F(a). 

Therefore, X has CDF F. 

The vast majority of random variables described in applications are one of two types, to be 
described next. A random variable X is a discrete random variable if there is a finite or countably 
infinite set of values {xi : i £ 1} such that P{X £ {xi : i £ /}} = 1. The probability mass function 
(pmf) of a discrete random variable X, denoted px(x), is defined by px(%) — P{X = x}. Typically 
the pmf of a discrete random variable is much more useful than the CDF. However, the pmf and 
CDF of a discrete random variable are related by px(%) — AFx(x) and conversely, 

F X {x) = Y. Px{y), (1.5) 

y:y<x 

where the sum in (1.5) is taken only over y such that pxiy) 7^ 0. If X is a discrete random variable 

with only finitely many mass points in any finite interval, then Fx is a piecewise constant function. 

A random variable X is a continuous random variable if the CDF is the integral of a function: 



/x 
fx(y)dy 
-oo 
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The function fx is called the probability density function (pdf). If the pdf fx is continuous at a 
point x, then the value fx{x) has the following nice interpretation: 

i rx+e 

fx(x) = lim- / fx{y)dy 

£^0 £ J x 

= lim -P\x < X < x + e\. 

e^o e 

If A is any Borel subset of R, then 

P{X G A} = [ f x {x)dx. (1.6) 

J A 

The integral in (1.6) can be understood as a Riemann integral if A is a finite union of intervals and 
/ is piecewise continuous or monotone. In general, fx is required to be Borel measurable and the 
integral is defined by Lebesgue integration. 2 

Any random variable X on an arbitrary probability space has a CDF Fx- As noted in the proof 
of Proposition 1.3.1 there exists a probability measure Px (called P in the proof) on the Borel 
subsets of R such that for any interval (a, b], 

Px((a,b}) = P{Xe(a,b}}. 

We define the probability distribution of X to be the probability measure Px- The distribution Px 
is determined uniquely by the CDF Fx- The distribution is also determined by the pdf fx if X 
is continuous type, or the pmf px if X is discrete type. In common usage, the response to the 
question "What is the distribution of X?" is answered by giving one or more of Fx, fx, or px, or 
possibly a transform of one of these, whichever is most convenient. 

1.4 Functions of a random variable 

Recall that a random variable lona probability space (0, T, P) is a function mapping Q to the 
real line R , satisfying the condition {uj : X(oj) < a} £ T for all a £ R. Suppose g is a function 
mapping R to R that is not too bizarre. Specifically, suppose for any constant c that {x : g{x) < c} 
is a Borel subset of R. Let Y(u) = g(X(u>)). Then Y maps to R and Y is a random variable. 
See Figure 1.6. We write Y = g(X). 

Often we'd like to compute the distribution of Y from knowledge of g and the distribution of 
X. In case X is a continuous random variable with known distribution, the following three step 
procedure works well: 

(1) Examine the ranges of possible values of X and Y. Sketch the function g. 

(2) Find the CDF of Y, using F Y (c) = P{Y < c} = P{g{X) < c}. The idea is to express the 
event {g{X) < c} as {X € A} for some set A depending on c. 



2 Lebesgue integration is defined in Sections 1.5 and 11.5 
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X(co) g(X(o))) 



Figure 1.6: A function of a random variable as a composition of mappings. 

(3) If Fy has a piecewise continuous derivative, and if the pdf fy is desired, differentiate Fy . 

If instead J is a discrete random variable then step 1 should be followed. After that the pmf of Y 
can be found from the pmf of X using 

My) = P{g(x) = y } = Yl M*) 

x:g(x)=y 

Example 1.4.1 Suppose X is a -/V(/x = 2, a 2 = 3) random variable (see Section 1.6 for the defini- 
tion) and Y = X 2 . Let us describe the density of Y. Note that Y = g(X) where g(x) = x 2 . The 
support of the distribution of X is the whole real line, and the range of g over this support is R_|_. 
Next we find the CDF, Fy. Since P{Y > 0} = 1, Fy(c) = for c < 0. For c > 0, 

F Y (c) = P{X 2 < c} = P{-V~c <X <^fc) 

/c-2 X-2 Vc-2 



Differentiate with respect to c, using the chain rule and the fact, $'(s) = -^==exp(— ^-) to obtain 



Mc) = { 7lbi ex P(-^ 2 ) +ex P(-[^] 2 )} ifc ^° (17) 

W I ifc<0 V ^ 



Example 1.4.2 Suppose a vehicle is traveling in a straight line at speed a, and that a random 
direction is selected, subtending an angle © from the direction of travel which is uniformly dis- 
tributed over the interval [0,7r]. See Figure 1.7. Then the effective speed of the vehicle in the 
random direction is B = ocos(G). Let us find the pdf of B. 

The range of acos(#) as 9 ranges over [0,tt] is the interval [—a, a]. Therefore, Fb(c) = for 
c < —a and Fb{c) = 1 for c > a. Let now —a < c < a. Then, because cos is monotone nonincreasing 



1.4. FUNCTIONS OF A RANDOM VARIABLE 



13 




on the interval [0, n], 



Figure 1.7: Direction of travel and a random direction. 

F B {c) = P{acos(6)<c} = p|cos(6)<-} 

= PJe^cos-^-)} 



Therefore, because cos x (y) has derivative, —(1 — y 2 ) 2, 

( i i 



fs(c) 



A sketch of the density is given in Figure 1.8. 







c\< a 
c\> a 




Figure 1.8: The pdf of the effective speed in a uniformly distributed direction. 



Example 1.4.3 Suppose Y = tan(G), as illustrated in Figure 1.9, where G is uniformly distributed 
over the interval (—§,§) • Let us find the pdf of Y. The function tan(0) increases from — oo to oo 
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Figure 1.9: A horizontal line, a fixed point at unit distance, and a line through the point with 
random direction. 

as 6 ranges over the interval (— §, § )• For any real c, 

F Y {c) = P{Y < c] 

= P{tan(G) < c} 

tan-^c) + § 



P{Q < tan _1 (c)} 



TV 



Differentiating the CDF with respect to c yields that Y has the Cauchy pdf: 

•fr( C ) = (1 i 2\ - C50 < c < oo 

7T(1 + C Z ) 



Example 1.4.4 Given an angle 9 expressed in radians, let {9 mod 2-k) denote the equivalent angle 
in the interval [0, 27r]. Thus, {9 mod 2tt) is equal to 9 + 2-7rn, where the integer n is such that 
< 9 + 2vrn < 2tt. 

Let G be uniformly distributed over [0, 2tt], let h be a constant, and let 

G = (® + h mod 2tt) 

Let us find the distribution of G. 

Clearly Q takes values in the interval [0, 27r], so fix c with < c < 2n and seek to find 
P{G < c}. Let A denote the interval [h, h + 2tt]. Thus, G + h is uniformly distributed over A. Let 
B = U n [27rn, 2iin + c]. Thus G < c if and only if G + h £ B. Therefore, 

P{Q<c} = [ ^d9 

JAf]B 27T 

By sketching the set B, it is easy to see that A(~)B is either a single interval of length c, or the 
union of two intervals with lengths adding to c. Therefore, P{G < c} = j-, so that G is itself 
uniformly distributed over [0, 2n] 



1.4. FUNCTIONS OF A RANDOM VARIABLE 15 



Example 1.4.5 Let X be an exponentially distributed random variable with parameter A. Let 
Y = [X\ , which is the integer part of X, and let R = X — [X\ , which is the remainder. We shall 
describe the distributions of Y and R. 

Clearly Y is a discrete random variable with possible values 0, 1, 2, ... , so it is sufficient to find 
the pmf of Y. For integers k > 0, 

rk+1 

PY (k) = P{k < X < k + 1} = / \e- Xx dx = e~ Xk (l - e~ x ) 

Jk 

and py(k) = for other k. 

Turn next to the distribution of R. Clearly R takes values in the interval [0, 1]. So let < c < 1 
and find Fr(c): 




Xc\ 



^P{k<X <k + c} = Y^ e" Afc (l - e 



1-e" 

k=0 fc=0 



where we used the fact 1 + a + a 2 + • • • = j^ for | a \ < 1 . Differentiating Fr yields the pmf: 



frtc) 



^^ 0<c<l 

1— e A — — 

otherwise 



What happens to the density of R as A — > or as A — > oo? By l'Hospital's rule, 

1 0< c< 1 



lim f R (c) = . 

A^o [ L) otherwise 

That is, in the limit as A — > 0, the density of X becomes more and more "evenly spread out," and 
R becomes uniformly distributed over the interval [0, 1]. If A is very large then the factor 1 — e _A is 
nearly one , and the density of R is nearly the same as the exponential density with parameter A. 

An important step in many computer simulations of random systems is to generate a random 
variable with a specified CDF, by applying a function to a random variable that is uniformly 
distributed on the interval [0,1]. Let F be a function satisfying the three properties required of a 
CDF, and let U be uniformly distributed over the interval [0, 1]. The problem is to find a function 
g so that F is the CDF of g(U). An appropriate function g is given by the inverse function of 
F. Although F may not be strictly increasing, a suitable version of i 7-1 always exists, defined for 
< u < 1 by 

F _1 (u) = min{x : F(x) > u} (1.8) 



16 
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F(u) 



t F(x) 



x 



u 



Figure 1.10: A CDF and its inverse. 

If the graphs of F and -F -1 are closed up by adding vertical lines at jump points, then the graphs 
are reflections of each other about the x = y line, as illustrated in Figure 1.10. It is not hard to 
check that for any real x and u with < u < 1, 

F(x ) > u if and only if x > F~ (u ) 

Thus, if X = F- l {U) then 

P{F~\U) < x} = P{U < F(x)} = F(x) 

so that indeed F is the CDF of X 

Example 1.4.6 Suppose F(x) = l — e~ x for x > and F(x) = for x < 0. Since F is continuously 
increasing in this case, we can identify its inverse by solving for a: as a function of u so that F{x) = u. 
That is, for < u < 1, we'd like 1 — e~ x = u which is equivalent to e~ x = 1 — u, or x = — ln(l — u). 
Thus, i 7-1 ^) = — ln(l — u). So we can take g(u) = — ln(l — u) for < u < 1. That is, if U is 
uniformly distributed on the interval [0, 1], then the CDF of — ln(l — U) is F. The choice of g is 
not unique in general. For example, 1 — U has the same distribution as U, so the CDF of — \n(U) 
is also F. To double check the answer, note that if x > 0, then 

P{- ln(l -U)<x} = P{ln(l -U)> -x} = P{1 - U > e~ x } = P{U < 1 - e~ x } = F(x). 



Example 1.4.7 Suppose F is the CDF for the experiment of rolling a fair die, shown on the left 
half of Figure 1.4. One way to generate a random variable with CDF F is to actually roll a die. 
To simulate that on a compute, we'd seek a function g so that g(U) has the same CDF. Using 
g = i 7-1 and using (1.8) or the graphical method illustrated in Figure 1.10 to find i 7-1 , we get 
that for < u < 1 
if 1 < i < 6, then 



g(u) = i for 



< u < 4 for 1 < i < 6. To double check the answer, note that 



P{g(U) =i} = P 



1 



<U < 



6 ~ 6 

so that g(U) has the correct pmf, and hence the correct CDF. 



1 
6 
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1.5 Expectation of a random variable 

The expectation, alternatively called the mean, of a random variable X can be defined in several 
different ways. Before giving a general definition, we shall consider a straight forward case. A 
random variable X is called simple if there is a finite set {xi, . . . , x m } such that X{uj) £ {xi, . . . , x m } 
for all u. The expectation of such a random variable is defined by 



E[X] 



Y^XiP{X = x t } 



(1.9) 



i=i 



The definition (1.9) clearly shows that E[X] for a simple random variable X depends only on the 
pmf of X. 

Like all random variables, X is a function on a probability space (£l,J-, P). Figure 1.11 illus- 
trates that the sum defining E[X] in (1.9) can be viewed as an integral over O. This suggests 
writing 



E[X] 



X(uj)P(oIuj) 



(1.10) 




X(a>)-x 



X((o)-x 3 



Figure 1.11: A simple random variable with three possible values. 



Let Y be another simple random variable on the same probability space as X, with Y{u) £ 
{yi, ■ ■ ■ ,y n } for all uj. Of course E[Y] = Y17=i Ui^i^ = Vi}- One learns in any elementary 
probability class that E[X + Y] = E[X] + .E[V]. Note that X + Y is again a simple random 
variable, so that E[X + Y] can be defined in the same way as E[X] was defined. How would you 
prove E[X+Y] = E[X] + E[Y}1 Is (1.9) helpful? We shall give a proof that E[X+Y] = E[X]+E[Y] 
motivated by (1.10). 

The sets {X = x±}, . . . , {X = x m } form a partition of Q. A refinement of this partition consists 
of another partition C±, . . . , C m > such that X is constant over each Cj. If we let x'- denote the value 
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of X on Cj , then clearly 

e[x] = $>;p(cy 

3 

Now, it is possible to select the partition G\, . . . , C m i so that both X and Y are constant over each 
Cj. For example, each Cj could have the form {X = Xi} (1 {Y = y^} for some i, k. Let y' denote 
the value of Y on Cj. Then a/- + y'- is the value of X + Y on Cj. Therefore, 

£pr + y] = J2(^ + y'j) p (c 3 ) = J2 x, 3 p ( c 3) + Y,y'3 p ( c 3) = E l x ] + E l Y ] 



While the expression (1.10) is rather suggestive, it would be overly restrictive to interpret it 
as a Riemann integral over £1. For example, if X is a random variable for the the standard unit- 
interval probability space defined in Example 1.1.2, then it is tempting to define E[X] by Riemann 
integration (see the appendix): 

E[X] = I X(uj)duj (1.11) 

Jo 

However, suppose X is the simple random variable such that X(w) = 1 for rational values of u and 
X(uS) = otherwise. Since the set of rational numbers in £1 is countably infinite, such X satisfies 
P{X = 0} = 1. Clearly we'd like E[X] = 0, but the Riemann integral (1.11) is not convergent for 
this choice of X. 

The expression (1.10) can be used to define E[X] in great generality if it is interpreted as a 
Lebesgue integral, defined as follows: Suppose X is an arbitrary nonnegative random variable. 
Then there exists a sequence of simple random variables X\,X2, ■ ■ ■ such that for every lu £ 0, 
Xi(lj) < X2(u) < • • • and X n {uj) — > X{uj) astn oo. Then £7[X n ] is well defined for each n and 
is nondecreasing in n, so the limit of £7[X n ] as n — > cxd exists with values in [0, +oo]. Furthermore 
it can be shown that the value of the limit depends only on (0, J 7 , P) and X, not on the particular 
choice of the approximating simple sequence. We thus define E[X] = linin^oo i?[X„]. Thus, E[X] 
is always well defined in this way, with possible value +oo, if X is a nonnegative random variable. 

Suppose X is an arbitrary random variable. Define the positive part of X to be the random 
variable X + defined by X + (uj) = max{0,X(uj)} for each value of u>. Similarly define the negative 
part of X to be the random variable X_{uu) = max{0, — X(u>)}. Then X{oj) = X + {lo) — X_(uj) 
for all u, and X + and X_ are both nonnegative random variables. As long as at least one of 
E[X+] or E[X-] is finite, define E[X] = E[X+] - E[XJ\. The expectation E[X] is undefined 
if E'fX+J = £J[X_] = +oo. This completes the definition of E[X] using (1.10) interpreted as a 
Lebesgue integral. 

We will prove that E[X] defined by the Lebesgue integral (1.10) depends only on the CDF of 
X. It suffices to show this for a nonnegative random variable X. For such a random variable, and 
n > 1, define the simple random variable X n by 

v ,. ( k2~ n if k2~ n < X{u) < (k + 1)2"™, A: = 0,l,...,2 2n -1 

Xn{u) ~ else 
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Then 

2 2 "-l 

E[X n ] = Y. k2 ~ n ^ F ^ k + l ) 2 ~ n )- F ^ k2 ~ n ) 

k=0 

so that i2[X n ] is determined by the CDF Fx for each n. Furthermore, the X n 's are nondecreasing 
in n and converge to X. Thus, E[X] = lim ra ^ 00 i?[X„], and therefore the limit E[X] is determined 
by F x . 

In Section 1.3 we defined the probability distribution Px of a random variable such that the 
canonical random variable X{u) = u on (R,B,Px) has the same CDF as X. Therefore E[X] = 
E[X], or 

/oo 
xP x (dx) (Lebesgue) (1.12) 

-oo 

By definition, the integral (1.12) is the Lebesgue-Stieltjes integral of x with respect to Fx, so that 

/oo 
xdFx(x) (Lebesgue-Stieltjes) (1.13) 

-oo 

Expectation has the following properties. Let X, Y be random variables and c a constant. 

E.l (Linearity) E[cX] = cE[X\. If E[X], E[Y] and E[X] + E[Y] are well defined, then 
E[X + Y] is well defined and E[X + Y] = E[X] + E[Y\. 

E.2 (Preservation of order) If P{X > Y} = 1 and E\Y] is well defined then E[X] is well 
defined and E[X] > E[Y}. 

E.3 If X has pdf f x then 

/oo 
xfx{x)dx (Lebesgue) 
-oo 

E.4 If X has pmf px then 



E[X] = y^ xpx(x) + yi xpx(x). 

x>0 x<0 

E.5 (Law of the unconscious statistician (LOTUS) ) If g is Borel measurable, 

E[g(X)} = ! g(X(u))P(duj} (Lebesgue) 

Jn 

/oo 
g(x)dFx(x) (Lebesgue-Stieltjes) 
-oo 

and in case X is a continuous type random variable 

/■oo 

E [a{X)\ = / g{x)fx{x)dx (Lebesgue) 
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E.6 (Integration by parts formula) 



E[X}= I (1-F x (x))dx 
'o 



Fx(x)dx, 



(1.14) 



which is well defined whenever at least one of the two integrals in (1.14) is finite. There is 
a simple graphical interpretation of (1.14). Namely, E[X] is equal to the area of the region 
between the horizontal line {y = 1} and the graph of Fx and contained in {x > 0}, minus 
the area of the region bounded by the x axis and the graph of Fx and contained in {x < 0}, 
as long as at least one of these regions has finite area. See Figure 1.12. 




Figure 1.12: E[X] is the difference of two areas. 



Properties E.l and E.2 are true for simple random variables and they carry over to general random 
variables in the limit defining the Lebesgue integral (1.10). Properties E.3 and E.4 follow from 
the equivalent definition (1.12) and properties of Lebesgue-Stieltjes integrals. Property E.5 can 
be proved by approximating g by piecewise constant functions. Property E.6 can be proved by 
integration by parts applied to (1.13). Alternatively, since F^ (U) has the same distribution as 
X, if U is uniformly distributed on the interval [0, 1], the law of the unconscious statistician yields 
that E[X] = L Fx (u)du, and this integral can also be interpreted as the difference of the areas 
of the same two regions. 

Sometimes, for brevity, we write EX instead of E[X] . The variance of a random variable X with 
EX finite is defined by Var(X) = E[(X — EX) 2 ]. By the linearity of expectation, if EX is finite, the 
variance of X satisfies the useful relation: Var(X) = E[X 2 - 2X(EX) + (EX) 2 } = E[X 2 } - (EX) 2 . 

The following two inequalities are simple and fundamental. The Markov inequality states that 
if Y is a nonnegative random variable, then for c > 0, 



P{Y >c}< 



E[Y] 



To prove Markov's inequality, note that Iiy>c} < ~~ j an d take expectations on each side. The 
Chebychev inequality states that if X is a random variable with finite mean \i and variance a 2 , then 
for any d > 0, 



P{\X - /j,\ >d}< 



d 2 
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The Chebychev inequality follows by applying the Markov inequality with Y = \X — fi\ 2 and c = d 2 . 
The characteristic function &x of a random variable X is defined by 

& x (u) = E[e juX ] 

for real values of u, where j = \/ — 1. For example, if X has pdf /, then 

/•CO 

®x(u) = / exp(jux)f x (x)dx, 



which is 2n times the inverse Fourier transform of fx- 

Two random variables have the same probability distribution if and only if they have the same 
characteristic function. If £ , [X fc ] exists and is finite for an integer k > 1, then the derivatives of 
&x up to order k exist and are continuous, and 

d><?(0) = j k E[X k ] 

For a nonnegative integer-valued random variable X it is often more convenient to work with the 
z transform of the pmf, defined by 



#x(z) = E[z x ] = Y,z k Px(k) 



k=0 

for real or complex z with | z |< 1. Two such random variables have the same probability dis- 
tribution if and only if their z transforms are equal. If E[X ] is finite it can be found from the 
derivatives of &x up to the fcth order at z = 1, 

*£>(1) = E[X(X-l)...(X-k + l)} 

1.6 Frequently used distributions 

The following is a list of the most basic and frequently used probability distributions. For each 
distribution an abbreviation, if any, and valid parameter values are given, followed by either the 
CDF, pdf or pmf, then the mean, variance, a typical example and significance of the distribution. 
The constants p, A, /i, a, a, b, and a are real-valued, and n and i are integer-valued, except n 
can be noninteger-valued in the case of the gamma distribution. 

Bernoulli: Be(p), < p < 1 

{p i = 1 
1 — p i = 
else 
^-transform: 1 — p + pz 
mean: p variance: p{\ — p) 

Example: Number of heads appearing in one flip of a coin. The coin is called fair if p — ^ an d 
biased otherwise. 
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Binomial: Bi(n,p), n>l,0<p<l 



pmf:p(i) 



n. 



p^l-p)""* 0<i<n 



z-transform: (1 — p + p.z) n 
mean: np variance: np(l — p) 



Example: Number of heads appearing in n independent flips of a coin. 



Poisson: Poi(X), A > 

X i e~ x 

pmf: p(i) = — i > 

il 

z-transform: exp(A(z — 1)) 
mean: A variance: A 

Example: Number of phone calls placed during a ten second interval in a large city. 

Significance: The Poisson pmf is the limit of the binomial pmf asm +oo and p — > in such a 
way that np — > A. 



Geometric: Geo(p), < p < 1 



pmf: p(z) = (1 — p) 4 1 p 
z-transform: 



* > 1 



1 — z + pz 



1 



mean: 



variance: 



1 — p 



p p" 

Example: Number of independent flips of a coin until heads first appears. 

Significant property: If X has the geometric distribution, P{X > i} = (1 — p) 1 for integers i > 1. 
So X has the memoryless property: 

P{X >i + j\X >i} = P{X > j} for i,j > 1. 

Any positive integer-valued random variable with this property has a geometric distribution. 
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Gaussian (also called Normal): N(fi,a 2 ), fi £ R, a > 



2 n\ tt \ l f ( x ~ VY 



pdf (if a z > 0): f(x) = -=== exp 

V2-rra z \ /cr 

pmf (if cr 2 = 0): p(x) = 



> Ml / > I • ^ = /" 



else 

■ ■ r • /• u 2 a 2 . 
characteristic function: exp{ju/i J 

mean: \x variance: a 

Example: Instantaneous voltage difference (due to thermal noise) measured across a resistor held 
at a fixed temperature. 

Notation: The character <I> is often used to denote the CDF of a iV(0, 1) random variable, 3 and Q 
is often used for the complementary CDF: 

r°° i _£_ 

Q(c) = I - $(c) = / — ^=e 2 dx 
Jc V2tt 

Significant property (Central limit theorem): If X\,X2, ■ ■ ■ are independent and identically dis- 
tributed with mean /j, and nonzero variance a 2 , then for any constant c, 

hm P { < c > = $(c) 

n^oo [ Vna 2 J 



Exponential: Exp (A), A > 



pdf: f(x) = Xe~ Xx x > 
characteristic function: 



X-ju 



1 1 

mean: — variance: — ^ 
A A^ 

Example: Time elapsed between noon sharp and the first telephone call placed in a large city, on 
a given day. 

Significance: If X has the Exp(A) distribution, P{X > t) — e~ xt for t > 0. So AT has the 
memoryless property: 

P{X > s + t | X > s} = P{X>t} s,t>0 

Any nonnegative random variable with this property is exponentially distributed. 



J As noted earlier, <J> is also used to denote characteristic functions. The meaning should be clear from the context. 
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Uniform: U(a, b) — oo < a < b < oo 

,,. ,./ % f ir— a < x < b 

pdf: /(x) = ^ - lge - 



characteristic function: 



e i uh — e J ua 
ju(b - a) 

a + b (b — a) 

mean: variance: 



2 



12 

Example: The phase difference between two independent oscillators operating at the same fre- 
quency may be modeled as uniformly distributed over [0, 2ir] 

Significance: Uniform is uniform. 
Gamma(n,a): n, a > (n real valued) 



pdf: f{x) = — x > 

r(n) 

where T(n) = / s n ~ 1 e~ s ds 
Jo 

characteristic function: 



n n 

mean: — variance 

a a 



a — ju 

2 



Significance: If n is a positive integer then T(n) = (n — 1)! and a Gamma (n,a) random variable 
has the same distribution as the sum of n independent, Exp(a) distributed random variables. 



Rayleigh(<7 2 ): 

pdf: f(r) = -2 exp ( -^~2 ) r>0 



CDF : 1 - exp . 



<j\ — variance: a (2 



7T 

mean. . 

2 



Example: Instantaneous value of the envelope of a mean zero, narrow band noise signal. 

Significance: If X and Y are independent, N(0, a 2 ) random variables, then [X 2 + Y 2 )? has the 
Rayleigh((j 2 ) distribution. Also notable is the simple form of the CDF. 
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1.7 Failure rate functions 

Eventually a system or a component of a particular system will fail. Let T be a random variable 
that denotes the lifetime of this item. Suppose T is a positive random variable with pdf fj>. The 
failure rate function, h = (h(t) : t > 0), of T (and of the item itself) is defined by the following 
limit: 

That is, given the item is still working after t time units, the probability the item fails within the 
next e time units is h(t)e + o(e). 

The failure rate function is determined by the distribution of T as follows: 

v ; e^o P{T > t}e 

F T (t + e) - F T (t) 



lim 

{l-F T (t))e 

(1.15) 



,o (1 - F T (t))e 
/r(*) 



1 - F T (t) ' 

because the pdf fx is the derivative of the CDF Ft- 

Conversely, a nonnegative function h = (h(t) : t > 0) with L h(t)dt = oo determines a 
probability distribution with failure rate function /i as follows. The CDF is given by 

F(t) = l-e-ti his)ds . (1.16) 

It is easy to check that F given by (1.16) has failure rate function h. To derive (1.16), and hence 
show it gives the unique distribution with failure rate function h, start with the fact that we would 
like F'/(l -F) = h. Equivalently, (ln(l - F)) 1 = -h or ln(l - F) = ln(l - F(0)) - f* h(s)als, which 
is equivalent to (1.16). 

Example 1.7.1 (a) Find the failure rate function for an exponentially distributed random variable 
with parameter A. (b) Find the distribution with the linear failure rate function h(t) = \ for t > 0. 
(c) Find the failure rate function of T = min{Ti,T2}, where T\ and T<i are independent random 
variables such that T\ has failure rate function h\ and T2 has failure rate function \i2- 

Solution: (a) If T has the exponential distribution with parameter A, then for t > 0, fr(t) — 
Xe~ and \ — Ft{£) = e~ , so by (1.15), h(t) = A for all t > 0. That is, the exponential distribution 
with parameter A has constant failure rate A. The constant failure rate property is connected with 
the memoryless property of the exponential distribution; the memoryless property implies that 
P{t < T < T + e\T > t) = P{T > e}, which in view of the definition of h shows that h is constant. 

(b) If h{t) = — for t > 0, then by (1.16), Fx(t) = 1 — e ^ 2 . The corresponding pdf is given by 

f T (t) = { ^'^ t>o 
else. 



26 CHAPTER 1. GETTING STARTED 

This is the pdf of the Rayleigh distribution with parameter a 1 . 
(c) By the independence and (1.15) applied to T\ and T 2 , 

P{T >t} = P{T X > t and T 2 > t} = P^ > t}P{T 2 > t} = e& ~ h ^ ds eti ~ h ^( s ) ds = e -/d fc W* 

where h = hi + h 2 - Therefore, the failure rate function for the minimum of two independent random 
variables is the sum of their failure rate functions. This makes intuitive sense; if there is a system 
that fails when either of one of two components fails, then the rate of system failure is the sum of 
the rates of component failure. 



1.8 Jointly distributed random variables 

Let X\, X 2 , . . . , X m be random variables on a single probability space (0,,^F, P). The joint cumu- 
lative distribution function 

(CDF) is the function on W n defined by 

Fx 1 x 2 -x m (xi,...,x m ) = P{Xi<xi,X 2 <X2,..-,X m <x m } 

The CDF determines the probabilities of all events concerning X\, . . . , X m . For example, if R is 
the rectangular region (a, 6] x (a', b'] in the plane, then 

P{(X 1 ,X 2 ) g R} = F XlX2 (b,b') - F XlX2 (a,b') - F Xl x 2 (b,a') + F Xl x 2 (a,a') 

We write +00 as an argument of F x in place of x« to denote the limit as X{ — > +00. By the 
countable additivity axiom of probability, 

F XlX2 (xi,+oo) = lim F XlX2 (xi,x 2 ) = F Xl (xi) 

The random variables are jointly continuous if there exists a function f XlX2 ~- Xm , called the 
joint probability density function (pdf), such that 

/Xl rx m 

•• / fx 1 X 2 -X m (ui, . ■ .,U m )dUm- ■ ■ du\. 

-00 J — 00 

Note that if X\ and X 2 are jointly continuous, then 

FxA x i) = F Xl x 2 (xi,+oo) 



Xl 



fx 1 x 2 (ui,u 2 )du 2 



du\. 



so that X\ has pdf given by 

/oo 
fx 1 X 2 (ui,U2)dU2- 
-00 
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The pdf's fx 1 and fx 2 are called the marginal pdfs for the joint pdf fx lt x 2 - 

If X\,X 2 , . . . , X m are each discrete random variables, then they have a joint pmf Px 1 x 2 -X m 
defined by 

px 1 x 2 -x m (ui,u 2 ,...,u m ) = P({x 1 = ui}ri{x 2 = u 2 }ri---ri{x m = u m }) 

The sum of the probability masses is one, and for any subset A of M. m 

P{(Xi,...,X m ) G A} = ^2 Px(ui,u 2 ,...,u m ) 

(ui,...,u m )eA 

The joint pmf of subsets of X\, . . . X m can be obtained by summing out the other coordinates of 
the joint pmf. For example, 

Px 1 (ui) = '^2px 1 x 2 (ui,u 2 ) 

u 2 

The joint characteristic function of X±, . . . , X m is the function on R m defined by 
®x,x 2 -x m (ui,u 2 ,...,u m ) = S [ e J(Xi«i+X 2 « s+ ...+X m « ra ) ] 

Random variables X\, . . . , X m are defined to be independent if for any Borel subsets A\, . . . , A m 
of M, the events {X\ £ A\\, . . . , {X m G A m } are independent. The random variables are indepen- 
dent if and only if the joint CDF factors. 

Fx 1 x 2 -x rn (xi,---,x m ) = F Xl (xi) ■ ■ ■ F Xm (x m ) 

If the random variables are jointly continuous, independence is equivalent to the condition that the 
joint pdf factors. If the random variables are discrete, independence is equivalent to the condition 
that the joint pmf factors. Similarly, the random variables are independent if and only if the joint 
characteristic function factors. 

1.9 Conditional densities 

Suppose that X and Y have a joint pdf fxY- Recall that the pdf fy, the second marginal density 
of fxr, is given by 

/oo 
fxr(x,y)dx 
-oo 

The conditional pdf of X given Y, denoted by fx\y( x I 2/)) is undefined if fy{y) = 0. It is defined 
for y such that /y(y) > by 

t i i ^ fxr(x,y) 

fx\Y{X y) = , -, , - OO < X < +00 

My) 

If y is fixed and fy(y) > 0, then as a function of x, fx\y( x I u) is itself a pdf. 
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The expectation of the conditional pdf is called the conditional expectation (or conditional 
mean) of X given Y = y, written as 



/oo 
xfx\r(x I y)dx 
-oo 



If the deterministic function E[X \ Y = y] is applied to the random variable Y, the result is a 
random variable denoted by E[X \ Y]. 

Note that conditional pdf and conditional expectation were so far defined in case X and Y have 
a joint pdf. If instead, X and Y are both discrete random variables, the conditional pmf Px\y an d 
the conditional expectation E[X \ Y = y] can be defined in a similar way. More general notions of 
conditional expectation are considered in a later chapter. 

1.10 Cross moments of random variables 

Let X and Y be random variables on the same probability space with finite second moments. Three 
important related quantities are: 

the correlation: £[X7] 

the covariance: Cov(X, Y) = E[(X - E[X])(Y - E[Y})\ 

Cov(X, Y) 



the correlation coefficient: pxY 



VVar(X)Var(F) 



A fundamental inequality is Schwarz's inequality: 

| E[XY] | < y/E[X 2 ]E[Y 2 ] (1.17) 

Furthermore, if i^fY 2 ] / 0, equality holds if and only if P(X = cY) = 1 for some constant c. 
Schwarz's inequality (1.17) is equivalent to the 1? triangle inequality for random variables: 

E[(X + Y) 2 } 5 < E[X 2 } 5 + E[Y 2 } s (1.18) 

Schwarz's inequality can be proved as follows. If P{Y = 0} = 1 the inequality is trivial, so suppose 
E[Y 2 ) > 0. By the inequality (a + b) 2 < 2a 2 + 2b 2 it follows that E[(X - AY) 2 ] < oo for any 
constant A. Take A = E[XY]/E[Y 2 ] and note that 

< E[(X - XY) 2 } = E[X 2 } - 2XE[XY] + X 2 E[Y 2 } 

E[XY} 2 



E[X< 



E[Y 2 } ' 



which is clearly equivalent to the Schwarz inequality. If P{X = cY) = 1 for some c then equality 
holds in (1.17), and conversely, if equality holds in (1.17) then P{X = cY) = 1 for c = A. 
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Application of Schwarz's inequality to X — E[X] and Y — E[Y] in place of X and Y yields that 



Cov(X, Y) | < v / Var(X)Var(y) 



Furthermore, if Var(F) / then equality holds if and only if X = aY + b for some constants a and 
b. Consequently, if Var(X) and Var(F) are not zero, so that the correlation coefficient pxY is well 
defined, then | pxY |< 1 with equality if and only if X = aY + b for some constants a, b. 
The following alternative expressions for Cov(X, Y) are often useful in calculations: 

Cov(X, Y) = E[X(Y - E[Y})\ = E[(X - E[X])Y] = E[XY] - E[X]E[Y] 

In particular, if either X or Y has mean zero then £[JF] = Cov(X, Y). 

Random variables X and Y are called orthogonal if £[X7] = and are called uncorrelated 
if Cov(X, Y) = 0. If X and Y are independent then they are uncorrelated. The converse is far 
from true. Independence requires a large number of equations to be true, namely Fxy{x,v) — 
Fx(x)Fy(v) for every real value of x and y. The condition of being uncorrelated involves only a 
single equation to hold. 

Covariance generalizes variance, in that Var(X) = Cov(X, X). Covariance is linear in each of 
its two arguments: 

Cov(X + Y, U + V) = Cov(X, U) + Cov(X, V) + Cov(Y, U) + Cov(F, V) 
Cov(aX + b, cY + d) = acCov(X,Y) 

for constants a, b, c, d. For example, consider the sum S m = X\ + • • • + X m , such that X\, • • • , X m 
are (pairwise) uncorrelated with i?[Xj] = /i and Var(Xj) = a 2 for 1 < i < m. Then E[S m ] = mp 
and 

Var(S' m ) = Cov(S' m , S m ) 

= J]Var(X i )+ Y, CoviX^Xj) 



i,j--i¥=3 



ma . 



S m —mfj, 



Therefore, —, — ^ has mean zero and variance one. 

V ma 2, 

1.11 Transformation of random vectors 

A random vector X of dimension m has the form 



X 



( X, \ 

x 2 
\ x m J 
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where X±, . . . , X m are random variables. The joint distribution of X\, . . . , X m can be considered 
to be the distribution of the vector X. For example, if X\, . . . , X m are jointly continuous, the joint 
pdf fxiX 2 -x m (xi, . . . ,x n ) can as well be written as fx(x), and be thought of as the pdf of the 
random vector X. 

Let X be a continuous type random vector on IR n . Let g be a one-to-one mapping from R n 
to W l . Think of g as mapping x-space (here x is lower case, representing a coordinate value) into 
y-space. As x varies over R n , y varies over the range of g. All the while, y = g(x) or, equivalently, 

x = 5 _1 (y)- 

Suppose that the Jacobian matrix of derivatives gf (x) is continuous in x and nonsingular for 
all x. By the inverse function theorem of vector calculus, it follows that the Jacobian matrix of the 
inverse mapping (from 
K to denote | det (IT) |. 



inverse mapping (from y to x) exists and satisfies f|(y) = (gf (x)) 1 . Use | K | for a square matrix 



Proposition 1.11.1 Under the above assumptions, Y is a continuous type random vector and for 
y in the range of g: 



My) 



fx(x) 



fx(x) 



dx 

dy 



(v) 



Example 1.11.2 Let U, V have the joint pdf: 

fuv(u,v) = 
and let X = U 2 and Y = U(l + V). Let's find the pdf fxY- The vector ([/, V) in the u — v plane is 



u + v < u,v < 1 
else 



transformed into the vector (X, Y) in the x — y plane under a mapping g that maps u, v to x = u 2 
and y = u{\ + v). The image in the x — y plane of the square [0, l] 2 in the u — v plane is the set A 
given by 



A 



{(x, y) : < x < 1, and \fx < y < 2\/x} 



See Figure 1.13 The mapping from the square is one to one, for if (x,y) G A then («,v) can be 
recovered by u = \fx and v = -H= — 1. The Jacobian determinant is 



9u Ot) 



2u 
1 + V u 



2u 2 



Therefore, using the transformation formula and expressing u and V in terms of x and y yields 



fxy(x,y) 



2x 





if (x,y) e A 

else 
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Figure 1.13: Transformation from the u — v plane to the x — y plane. 



Example 1.11.3 Let U and V be independent continuous type random variables. Let X = U + V 
and Y = V. Let us find the joint density of X, Y and the marginal density of X. The mapping 

g : ( u v ) — > ( u v ) = (n + v v ) 

is invertible, with inverse given by u = x — y and v = y. The absolute value of the Jacobian 
determinant is given by 



dx dx 

du dv 

dy dy 

du dv 



1 1 

1 



1 



Therefore 

fxr(x, y) = fuv(u, v) = f v {x - y)fv(y) 
The marginal density of X is given by 

/OO /"OO 

fxY(x,y)dy= fu(x - y)f v (y)dy 

-oo J — OO 

That is f x = fu* fv- 



Example 1.11.4 Let X\ and X2 be independent N(0, a 2 ) random variables, and let X = (X\, X2) T 
denote the two-dimensional random vector with coordinates X\ and X2. Any point of a; 6 I 2 can 
be represented in polar coordinates by the vector (r,9) T such that r = ||x|| = (x\ + x?,)? and 
9 = tan _1 ( — ) with values r > and < 6 < 2-k. The inverse of this mapping is given by 

x\ = rcos(9) 
X2 = rs'm(9) 
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We endeavor to find the pdf of the random vector (R, Q) T , the polar coordinates of X. The pdf of 
X is given by 



fx(x) = fx 1 (xi)fx 2 (x2) 



1 _J^ 

6 2cr2 



27TCT 2 



The range of the mapping is the set r > and < 9 < 2tt. On the range, 



Ox 







dxi dx\ 

dr ae 

8X2 8X2 

dr ae 



cos(#) — rsin(#) 
sin(#) rcos(6*) 



Therefore for (r, 9) in the range of the mapping, 



fn,e( r i 



fx(x) 



Ox 







9 



re 2^ 2 



Of course /n^r, 9) = off the range of the mapping. The joint density factors into a function of 
r and a function of 9, so R and are independent. Moreover, R has the Rayleigh density with 
parameter a 2 , and O is uniformly distributed on [0,27i~]. 



1.12 Problems 



1.1 Simple events 

A register contains 8 random binary digits which are mutually independent. Each digit is a zero or 
a one with equal probability, (a) Describe an appropriate probability space ($1, J 7 , P) corresponding 
to looking at the contents of the register. 

(b) Express each of the following four events explicitly as subsets of 0, and find their probabilities: 
£'i = "No two neighboring digits are the same" 

-E2="Some cyclic shift of the register contents is equal to 01100110" 
.E3="The register contains exactly four zeros" 
E^= "There is a run of at least six consecutive ones" 

(c) Find P(£i|£3) and P(E 2 \E 3 ). 

1.2 Independent vs. mutually exclusive 

(a) Suppose that an event E is independent of itself. Show that either P(E) = or P{E) = 1. 

(b) Events A and B have probabilities P(A) = 0.3 and P(B) = 0.4. What is P(A U B) if A and B 
are independent? What is P{A U B) if A and B are mutually exclusive? 

(c) Now suppose that P{A) = 0.6 and P{B) = 0.8. In this case, could the events A and B be 
independent? Could they be mutually exclusive? 
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1.3 Congestion at output ports 

Consider a packet switch with some number of input ports and eight output ports. Suppose four 
packets simultaneously arrive on different input ports, and each is routed toward an output port. 
Assume the choices of output ports are mutually independent, and for each packet, each output 
port has equal probability. 

(a) Specify a probability space (Q,J-,P) to describe this situation. 

(b) Let Xi denote the number of packets routed to output port i for 1 < i < 8. Describe the joint 
pmf of Xi, . . . , Xg. 

(c) Find Cov(X 1 ,X 2 ). 

(d) Find P{Xi < 1 for all i}. 

(e) Find P{X { < 2 for all i}. 

1.4 Frantic search 

At the end of each day Professor Plum puts her glasses in her drawer with probability .90, leaves 
them on the table with probability .06, leaves them in her briefcase with probability 0.03, and she 
actually leaves them at the office with probability 0.01. The next morning she has no recollection 
of where she left the glasses. She looks for them, but each time she looks in a place the glasses are 
actually located, she misses finding them with probability 0.1, whether or not she already looked 
in the same place. (After all, she doesn't have her glasses on and she is in a hurry.) 

(a) Given that Professor Plum didn't find the glasses in her drawer after looking one time, what is 
the conditional probability the glasses are on the table? 

(b) Given that she didn't find the glasses after looking for them in the drawer and on the table 
once each, what is the conditional probability they are in the briefcase? 

(c) Given that she failed to find the glasses after looking in the drawer twice, on the table twice, 
and in the briefcase once, what is the conditional probability she left the glasses at the office? 

1.5 Conditional probability of failed device given failed attempts 

A particular webserver may be working or not working. If the webserver is not working, any attempt 
to access it fails. Even if the webserver is working, an attempt to access it can fail due to network 
congestion beyond the control of the webserver. Suppose that the a priori probability that the server 
is working is 0.8. Suppose that if the server is working, then each access attempt is successful with 
probability 0.9, independently of other access attempts. Find the following quantities. 

(a) P( first access attempt fails) 

(b) P(server is working | first access attempt fails ) 

(c) P(second access attempt fails | first access attempt fails ) 

(d) P(server is working | first and second access attempts fail ). 

1.6 Conditional probabilities—basic computations of iterative decoding 

Suppose Bi, . . . , B n , Yi, . . . , Y n are discrete random variables with joint pmf 

2" n lir=i Qi(Vi\bi) if h £ {0, 1} for 1 < i < n 



p(bi,...,bn,yi,...,y n ) , () dge 

where qi{yi\bi) as a function of y« is a pmf for fej £ {0, 1}. Finally, let B = B\®- ■ -®B n represent the 
modulo two sum of B\, ■ ■ ■ , B n . Thus, the ordinary sum of the n+\ random variables B\, . . . , B n , B 
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is even. Express P{B = 1\Y\ = y±, ■ ■ ■ .Y n = y n ) in terms of the yi and the functions %. Simplify 

your answer. 

(b) Suppose B and Z\, . . . , Z\~ are discrete random variables with joint pmf 



p(b,zi,...,z k ) 



5^=1^(^-16) if b £{0,1} 
else 



where rj(zj\b) as a function of Zj is a pmf for b £ {0, 1} fixed. Express P{B = \\Z\ = z\, 
Zk) in terms of the Zj and the functions r 



J- 



1.7 Conditional lifetimes and the memoryless property of the geometric distribution 

(a) Let X represent the lifetime, rounded up to an integer number of years, of a certain car battery. 
Suppose that the pmf of X is given by px(k) = 0.2 if 3 < k < 7 and px(k) = otherwise, (i) 
Find the probability, P{X > 3}, that a three year old battery is still working, (ii) Given that the 
battery is still working after five years, what is the conditional probability that the battery will 
still be working three years later? (i.e. what is P{X > 8|X > 5)?) 

(b) A certain Illini basketball player shoots the ball repeatedly from half court during practice. 
Each shot is a success with probability p and a miss with probability 1 — p, independently of the 
outcomes of previous shots. Let Y denote the number of shots required for the first success, (i) 
Express the probability that she needs more than three shots for a success, P{Y > 3}, in terms of 
p. (ii) Given that she already missed the first five shots, what is the conditional probability that 
she will need more than three additional shots for a success? (i.e. what is P(Y > 8|y > 5))? 

(iii) What type of probability distribution does Y have? 

1.8 Blue corners 

Suppose each corner of a cube is colored blue, independently of the other corners, with some 
probability p. Let B denote the event that at least one face of the cube has all four corners colored 
blue, (a) Find the conditional probability of B given that exactly five corners of the cube are 
colored blue, (b) Find P(B), the unconditional probability of B. 

1.9 Distribution of the flow capacity of a network 

A communication network is shown. The link capacities in megabits per second (Mbps) are given 
by C\ = C3 = 5, C2 = C5 = 10 and 6*4=8, and are the same in each direction. Information 
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flow from the source to the destination can be split among multiple paths. For example, if all 
links are working, then the maximum communication rate is 10 Mbps: 5 Mbps can be routed over 
links 1 and 2, and 5 Mbps can be routed over links 3 and 5. Let Fi be the event that link i fails. 
Suppose that ^1,^2,^3,^4 and F§ are independent and P{Fi) = 0.2 for each i. Let X be defined 
as the maximum rate (in Mbits per second) at which data can be sent from the source node to the 
destination node. Find the pmf px- 



1.10 Recognizing cumulative distribution functions 

Which of the following are valid CDF's? For each that is not valid, state at least one reason why. 
For each that is valid, find P{X 2 > 5}. 



F 1 (x) = 



x <0 
x > 



F 2 (x) = 




x < 

0<x<3 F 3 (x) 

x > 3 




x < 

55 0<z<10 

x > 10 



1.11 A CDF of mixed type 

Let X have the CDF shown. 



1.0 



(a) Find P{X < 0.8}. 

(b) Find ELY]. 

(c) Find Var(X). 



1.12 CDF and characteristic function of a mixed type random variable 

Let X = (U — 0.5) + , where U is uniformly distributed over the interval [0, 1]. That is, X 
if U - 0.5 > 0, and X = if U - 0.5 < 0. 

(a) Find and carefully sketch the CDF Fx- In particular, what is -Fx(O)? 

(b) Find the characteristic function §x{u) for real values of u. 



17-0.5 



1.13 Poisson and geometric random variables with conditioning 

Let Y be a Poisson random variable with mean /x > and let Z be a geometrically distributed 
random variable with parameter p with < p < 1. Assume Y and Z are independent. 

(a) Find P{Y < Z}. Express your answer as a simple function of fi and p. 

(b) Find P(Y < Z\Z = i) for i > 1. (Hint: This is a conditional probability for events.) 

(c) Find P(Y = i\Y < Z) for i > 0. Express your answer as a simple function of p, /x and i. (Hint: 
This is a conditional probability for events.) 

(d) Find i?[Y|Y < Z], which is the expected value computed according to the conditional distribu- 
tion found in part (c). Express your answer as a simple function of ii and p. 



36 CHAPTER 1. GETTING STARTED 

1.14 Conditional expectation for uniform density over a triangular region 

Let (X, Y) be uniformly distributed over the triangle with coordinates (0, 0), (1, 0), and (2, 1). 

(a) What is the value of the joint pdf inside the triangle? 

(b) Find the marginal density of X, fx(x)- Be sure to specify your answer for all real values of x. 

(c) Find the conditional density function fy\x (y\ x )- Be sure to specify which values of x the 
conditional density is well defined for, and for such x specify the conditional density for all y. Also, 
for such x briefly describe the conditional density of y in words. 

(d) Find the conditional expectation £?[y|X = x\. Be sure to specify which values of x this 
conditional expectation is well defined for. 

1.15 Transformation of a random variable 

Let X be exponentially distributed with mean A -1 . Find and carefully sketch the distribution 
functions for the random variables Y = exp(X) and Z = min(X, 3). 

1.16 Density of a function of a random variable 

Suppose X is a random variable with probability density function 

, , N f 2x < x < 1 
^ = { else 

(a) Find P(X > 0A\X < 0.8). 

(b) Find the density function of Y defined by Y = — log(X). 

1.17 Moments and densities of functions of a random variable 

Suppose the length L and width W of a rectangle are independent and each uniformly distributed 
over the interval [0, 1]. Let C = 2L + 2W (the length of the perimeter) and A = LW (the area). 
Find the means, variances, and probability densities of C and A. 

1.18 Functions of independent exponential random variables 

Let X\ and X2 be independent random varibles, with Xi being exponentially distributed with 
parameter Aj. (a) Find the pdf of Z = min{Xi, X2}. (b) Find the pdf of R = j^. 

1.19 Using the Gaussian Q function 

Express each of the given probabilities in terms of the standard Gaussian complementary CDF Q. 

(a) P{X > 16}, where X has the JV(10,9) distribution. 

(b) P{X 2 > 16}, where X has the iV(10, 9) distribution. 

(c) P{|X — 2Y\ > 1}, where X and Y are independent, N(0, 1) random variables. (Hint: Linear 
combinations of independent Gaussian random variables are Gaussian.) 

1.20 Gaussians and the Q function 

Let X and Y be independent, N(0, 1) random variables. 

(a) Find Cov(3X + 2Y, X + 5Y + 10). 

(b) Express P{X + 4Y > 2} in terms of the Q function. 

(c) Express P{(X — Y) 2 > 9} in terms of the Q function. 
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1.21 Correlation of histogram values 

Suppose that n fair dice are independently rolled. Let 

( 1 if a 1 shows on the i th roll ( 1 if a 2 shows on the i th roll 

|_ else * |_ else 

Let X denote the sum of the Xj's, which is simply the number of l's rolled. Let Y denote the sum 
of the Yi's, which is simply the number of 2's rolled. Note that if a histogram is made recording 
the number of occurrences of each of the six numbers, then X and Y are the heights of the first 
two entries in the histogram. 

(a) Find E[Xi] and Var(Xi). 

(b) Find E[X] and Var(X). 

(c) Find Cov(Xj, Yj) if 1 < i,j < n (Hint: Does it make a difference if i = j?) 

(d) Find Cov(X,Y) and the correlation coefficient p{X,Y) = Covpf, F)/ v / Var(X)Var(F). 

(e) Find i?[Y|X = x] for any integer x with < x < n. Note that your answer should depend on 
x and n, but otherwise your answer is deterministic. 

1.22 Working with a joint density 

Suppose X and Y have joint density function fx,y(x,y) = c(l + xy) if 2 < x < 3 and 1 < y < 2, 
and fx,Y(x,y) = otherwise, (a) Find c. (b) Find fx and /y. (c) Find fx\Y- 

1.23 A function of jointly distributed random variables 

Suppose (U,V) is uniformly distributed over the square with corners (0,0), (1,0), (1,1), and (0,1), 
and let X = UV. Find the CDF and pdf of X. 

1.24 Density of a difference 

Let X and Y be independent, exponentially distributed random variables with parameter A, such 
that A > 0. Find the pdf of Z = \X - Y\. 



1.25 Working with a two dimensional density 

Let the random variables X and Y be jointly uniformly distributed over the region shown. 



(a) Determine the value of fx,Y on the region shown. 

(b) Find fx, the marginal pdf of X. 

(c) Find the mean and variance of X. 

(d) Find the conditional pdf of Y given that X = x, for < x < 1. 

(e) Find the conditional pdf of Y given that X = x, for 1 < x < 2. 

(f) Find and sketch £?[y|X = x] as a function of x. Be sure to specify which range of x this 
conditional expectation is well defined for. 
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1.26 Some characteristic functions 

Find the mean and variance of random variables with the following characteristic functions: (a) 
$(«) = exp(-5u 2 + 2ju) (b) $(u) = {e ju - l)/ju, and (c) $(«) = exp(A(e JU - 1)). 

1.27 Uniform density over a union of two square regions 

Let the random variables X and Y be jointly uniformly distributed on the region {0 < u < 1, < 
■u < 1} U { — 1 < u < 0, — 1 < v < 0}. (a) Determine the value of fxY on the region shown. 

(b) Find fx, the marginal pdf of X. 

(c) Find the conditional pdf of Y given that X = a, for < a < 1. 

(d) Find the conditional pdf of Y given that X = a, for — 1 < a < 0. 

(e) Find E[Y\X = a] for \a\ < 1. 

(f ) What is the correlation coefficient of X and V? 

(g) Are X and Y" independent? 

(h) What is the pdf of Z = X + Yl 

1.28 A transformation of jointly continuous random variables 

Suppose (U, V) has joint pdf 

9u 2 v 2 if <u<l & 0<v < 1 



'<■■»■("■''>- 1 else 

Let X = 3£/ and Y = UV. (a) Find the joint pdf of X and Y, being sure to specify where the joint 
pdf is zero. 

(b) Using the joint pdf of X and Y, find the conditional pdf, fY\x(y\ x )i °f Y given X. (Be sure to 
indicate which values of x the conditional pdf is well defined for, and for each such x specify the 
conditional pdf for all real values of y.) 

1.29 Transformation of densities 

Let U and V have the joint pdf: 

c(u — v) 2 < u, v < 1 



frv{ "- r) > else 

for some constant c. (a) Find the constant c. (b) Suppose X = U 2 and Y = U 2 V 2 . Describe the 
joint pdf fx,y(x, y) of X and Y. Be sure to indicate where the joint pdf is zero. 

1.30 Jointly distributed variables 

Let U and V be independent random variables, such that U is uniformly distributed over the 
interval [0, 1], and V has the exponential probability density function 

(a) Calculate ^[^j]. 

(b) Calculate P{U < V}. 

(c) Find the joint probability density function of Y and Z, where Y = U 2 and Z = UV. 

1.31 * Why not every set has a length 

Suppose a length (actually, "one- dimensional volume" would be a better name) of any subset Acl 
could be defined, so that the following axioms are satisfied: 
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LO: < length(A) < oo for any A C R 

LI: length([a, b]) = b — a for a < b 

L2: length(^4) = length(^4 + y), for any id and y G R, where A + y represents the translation 
of A by y, defined by A + y = {x + y : x G ^4} 

L3: If A = U^tfj such that #i, B 2 , • • • are disjoint, then length(A) = Y%Li length(Bj). 

The purpose of this problem is to show that the above supposition leads to a contradiction. Let 
Q denote the set of rational numbers, Q = {p/q : p,q G Z,q / 0}. (a) Show that the set of 
rational numbers can be expressed as Q = {qi,q2 , ■ ■ ■}, which means that Q is countably infinite. 
Say that x,y G R are equivalent, and write x ~ y, if x — y G Q. (b) Show that ~ is an equivalence 
relation, meaning it is reflexive (a ~ a for all a G R), symmetric (a ~ b implies b ~ a), and 
transitive (a ~ b and 6 ~ c implies a ~ c). For any x G R, let Q x = Q + x. (c) Show that for 
any x, y G R, either Qx = Qy or Qx l~l Qy = 0. Sets of the form Q^ are called equivalence classes 
of the equivalence relation ~. (d) Show that Q x D [0, 1] 7^ for all x G i?, or in other words, each 
equivalence class contains at least one element from the interval [0,1]. Let V be a set obtained 
by choosing exactly one element in [0, 1] from each equivalence class (by accepting that V is well 
defined, you'll be accepting what is called the Axiom of Choice). So V is a subset of [0, 1]. Suppose 
4\-,q'ii ■ ■ ■ is an enumeration of all the rational numbers in the interval [—1,1], with no number 
appearing twice in the list. Let V% = V + q[ for i > 1. (e) Verify that the sets V% are disjoint, and 
[0, 1] C U^Vi C [—1, 2]. Since the V^s are translations of V, they should all have the same length 
as V. If the length of V is defined to be zero, then [0, 1] would be covered by a countable union 
of disjoint sets of length zero, so [0, 1] would also have length zero. If the length of V were strictly 
positive, then the countable union would have infinite length, and hence the interval [—1,2] would 
have infinite length. Either way there is a contradiction. 

1.32 * On sigma-algebras, random variables, and measurable functions 

Prove the seven statements lettered (a)-(g) in what follows. 

Definition. Let 0, be an arbitrary set. A nonempty collection T of subsets of £1 is defined to be 

an algebra if: (i) A c G T whenever A £ F and (ii) A U B G T whenever j4,Bef. 

(a) If T is an algebra then G J-, £1 G J-, and the union or intersection of any finite collection of 
sets in T is in T ' . 

Definition. T is called a a-algebra if T is an algebra such that whenever Ai,A2,... are each in F, 
so is the union, UAi. 

(b) If J 7 is a o"-algebra and Bi, B2, ■ ■ ■ are in F, then so is the intersection, fl-Bj. 

(c) Let U be an arbitrary nonempty set, and suppose that T u is a c-algebra of subsets of Q for 
each u G U. Then the intersection D u ^uJ- u is also a a-algebra. 

(d) The collection of all subsets of fl is a a-algebra. 

(e) If T is any collection of subsets of O then there is a smallest a-algebra containing T (Hint: 
use (c) and (d).) 

Definitions. B{R) is the smallest a-algebra of subsets of R which contains all sets of the form 
(—00, a]. Sets in B(R) are called Borel sets. A real-valued random variable on a probability space 
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(Q,J-, P) is a real-valued function X on £1 such that {lo : X(ui) < a} G T for any a G R. 

(f) If X is a random variable on (fi,jF,P) and ^ G £(P) then {w : X(w) £ A) £ T. (Hint: Fix 
a random variable X. Let D be the collection of all subsets A of B{R) for which the conclusion is 
true. It is enough (why?) to show that T> contains all sets of the form ( — oo,a] and that T> is a 
a-algebra of subsets of R. You must use the fact that T is a c-algebra.) 

Remark. By (f), P{uj : X{uj) G A}, or P{X G A} for short, is well defined for A G 23(P). 
Definition. A function <? mapping R to P is called Borel measurable if {x : g{x) G A} £ B{R) 
whenever A G B{R). 

(g) If X is a real-valued random variable on (Q, T, P) and 5 is a Borel measurable function, then 
Y defined by Y = g{X) is also a random variable on (fi, J 7 , P). 



Chapter 2 

Convergence of a Sequence of 
Random Variables 



Convergence to limits is a central concept in the theory of calculus. Limits are used to define 
derivatives and integrals. We wish to consider derivatives and integrals of random functions, so it 
is natural to begin by examining what it means for a sequence of random variables to converge. 
See the Appendix for a review of the definition of convergence for a sequence of numbers. 

2.1 Four definitions of convergence of random variables 

Recall that a random variable X is a function on for some probability space (fl, J-, P). A sequence 
of random variables (X n (u) : n > 1) is hence a sequence of functions. There are many possible 
definitions for convergence of a sequence of random variables. 

One idea is to require X n (u) to converge for each fixed u. However, at least intuitively, what 
happens on an event of probability zero is not important. Thus, we use the following definition. 

Definition 2.1.1 A sequence of random variables {X n : n > 1) converges almost surely to a 
random variable X, if all the random variables are defined on the same probability space, and 
Pjlinijj^oo X n = X} = 1. Almost sure convergence is denoted by linin^^ X n = X a.s. or X n - L >' X. 



Conceptually, to check almost sure convergence, one can first find the set {uj : linin^oo X n {uj) = 
X(u)} and then see if it has probability one. 

We shall construct some examples using the standard unit-interval probability space defined 
in Example 1.1.2. This particular choice of (£l,F, P) is useful for generating examples, because 
random variables, being functions on ft, can be simply specified by their graphs. For example, 
consider the random variable X pictured in Figure 2.1. The probability mass function for such X 
is given by P{X = 1} = P{X = 2} = \ and P{X = 3} = \. Figure 2.1 is a bit sloppy, in that it 
is not clear what the values of X are at the jump points, u = 1/4 or uj = 1/2. However, each of 
these points has probability zero, so the distribution of X is the same no matter how X is defined 
at those points. 
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X(co) 



00 



1 1 1 1 

4 2 4 



Figure 2.1: A random variable on (Q,!F,P). 



Example 2.1.2 Let (X n : n > 1) be the sequence of random variables on the standard unit-interval 
probability space defined by X n (u>) = u; n , illustrated in Figure 2.2. This sequence converges for all 



X/co) 




X^co) 




X 3 (co) 




X/co) 




Figure 2.2: J„(w) = uj n on the standard unit-interval probability space. 

well, with the limit 

.. Y , , /0 if0<w<l 
lim X n (u) = < .. 1 

n— >oo [1 if (jj = 1. 

The single point set {1} has probability zero, so it is also true (and simpler to say) that (X n : n > 1) 
converges a.s. to zero. In other words, if we let X be the zero random variable, defined by X{u) = 
for all to, then X n -V X. 



Example 2.1.3 (Moving, shrinking rectangles) Let (X n : n > 1) be the sequence of random 
variables on the standard unit-interval probability space, as shown in Figure 2.3. The variable 
X\ is identically one. The variables X2 and X3 are one on intervals of length g. The variables 
X4,X5,X6, and X7 are one on intervals of length j. In general, each n > 1 can be written as 
n = 2 k + j where k = |_hi2 n\ and < j < 2 k . The variable X n is one on the length 2~ k interval 
(j2- k ,(j + l)2- k }. 

To investigate a.s. convergence, fix an arbitrary value for u. Then for each k > 1, there 
is one value of n with 2 k < n < 2 k+l such that X n (u) = 1, and X n {uj) = for all other n. 
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X(co) 

1 
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XJm) X,(co) 



1 
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1 1 



X^co) 



X/co) 



1 1 



Figure 2.3: A sequence of random variables on (O,^ 7 , P). 

Therefore, linin^oo X n (uj) does not exist. That is, {uj : linij^oo X n exists} = 0, so of course, 
P{lini n ^oo X n exists} = 0. Thus, X n does not converge in the a.s. sense. 

However, for large n, P{X n = 0} is close to one. This suggests that X n converges to the zero 
random variable in some weaker sense. 

Example 2.1.3 motivates us to consider the following weaker notion of convergence of a sequence 
of random variables. 

Definition 2.1.4 A sequence of random variables (X n ) converges to a random variable X in prob- 
ability if all the random variables are defined on the same probability space, and for any e > 0, 
lim n ^ 00 P{|X — X n \ > e} = 0. Convergence in probability is denoted by lin^^oo X n = X p., or 
X n — > X. 



Convergence in probability requires that \X — X n \ be small with high probability (to be precise, 
less than or equal to e with probability that converges to one as n — > oo), but on the small 
probability event that \X — X n \ is not small, it can be arbitrarily large. For some applications that 
is unacceptable. Roughly speaking, the next definition of convergence requires that \X — X n \ be 
small with high probability for large n, and even if it is not small, the average squared value has 
to be small enough. 



Definition 2.1.5 A sequence of random variables (X n ) converges to a random variable X in the 
mean square sense if all the random variables are defined on the same probability space, E[X%] < 
+oo for all n, and linin^oo E[(X n — X) 2 ] = 0. Mean square convergence is denoted by 
linin^oo X n = X m.s. or X n -V X. 

Although it isn't explicitly stated in the definition of m.s. convergence, the limit random variable 
must also have a finite second moment: 
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Proposition 2.1.6 If X n "^' X, then E[X 2 } < +oo. 

Proof. Suppose that X n -V X. By definition, i?[X 2 ] < c>o for ali n. Also by definition, there 
exists some n so that E[(X — X n ) 2 ] < 1 for all n > n . The I? triangle inequality for random 
variables, (1.18), yields E^X^) 2 ]* < E^X^ - X n J 2 ]§ + E[X 2 j2 < +oo. ■ 



Example 2.1.7 (More moving, shrinking rectangles) This example is along the same lines as 
Example 2.1.3, using the standard unit-interval probability space. Each random variable of the 
sequence (X n : n > 1) is defined as indicated in Figure 2.4. where the value a n > is some 



X(co) 



1/n 



Figure 2.4: A sequence of random variables corresponding to moving, shrinking rectangles. 

constant depending on n. The graph of X n for n > 1 has height a n over some subinterval of 0, of 
length -. We don't explicitly identify the location of the interval, but we require that for any fixed 
uj, X n {uj) = a n for infinitely many values of n, and X n (u) = for infinitely many values of n. Such 
a choice of the locations of the intervals is possible because the sum of the lengths of the intervals, 
J2n=i h is infinite- 

Of course X n -V if the deterministic sequence (a n ) converges to zero. However, if there 
is a constant e > such that a n > e for all n (for example if a n = 1 for all n), then {to : 
lim n ^ 00 X n {uj) exists} = 0, just as in Example 2.1.3. The sequence converges to zero in probability 
for any choice of the constants (a n ), because for any e > 0, 



P{\X n - 0| > e} < P{X n ± 0} 



1 

n 



0. 



Oif 



2 

Finally, to investigate mean square convergence, note that i£[|X n — 0| 2 ] = — . Hence, X n 

2 

and only if the sequence of constants (a n ) is such that lim n ^ 00 — = 0. For example, if a n = ln(n) 
for all n, then X n - lj >' 0, but if a n = \/n, then (X n ) does not converge to zero in the m.s. sense. 
(Proposition 2.1.13 below shows that a sequence can have only one limit in the a.s., p., or m.s. 
senses, so the fact X n — > 0, implies that zero is the only possible limit in the m.s. sense. So if 
/> 0, then (X n ) doesn't converge to any random variable in the m.s. sense.) 



n 
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Example 2.1.8 (Anchored, shrinking rectangles) Let {X n : n > 1) be a sequence of random 
variables defined on the standard unit-interval probability space, as indicated in Figure 2.5, where 



X ( m ) 



Figure 2.5: A sequence of random variables corresponding to anchored, shrinking rectangles. 

the value a n > is some constant depending on n. That is, X n {uj) is equal to a n if < u < 1/n, 
and to zero otherwise. For any nonzero u> in Q, X n (u) = for all n such that n > 1/w. Therefore, 
X n a 4- 0. 

Whether the sequence (X n ) converges in p. or m.s. sense for this example is exactly the same 
as in Example 2.1.7. That is, for convergence in probability or mean square sense, the locations of 
the shrinking intervals of support don't matter. So X n -4 0. And X n -V if and only if -?■ — > 0. 

It is shown in Proposition 2.1.13 below that either a.s. or m.s. convergence imply convergence in 
probability. Example 2.1.8 shows that a.s. convergence, like convergence in probability, can allow 
\X n (u) — X(oj)\ to be extremely large for w in a small probability set. So neither convergence in 
probability, nor a.s. convergence, imply m.s. convergence, unless an additional assumption is made 
to control the difference \X n {oj) — X(oj)\ everywhere on Q.. 

Example 2.1.9 (Rearrangements of rectangles) Let (X n : n > 1) be a sequence of random vari- 
ables defined on the standard unit-interval probability space. The first three random variables 
in the sequence are indicated in Figure 2.6. Suppose that the sequence is periodic, with period 
three, so that X n+ ^ = X n for all n > 1. Intuitively speaking, the sequence of random variables 



X/co) 



CO 



X^ra) 



X 3 (w) 



Figure 2.6: A sequence of random variables obtained by rearrangement of rectangles. 
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persistently jumps around. Obviously it does not converge in the a.s. sense. The sequence does 
not settle down to converge, even in the sense of convergence in probability, to any one random 
variable. This can be proved as follows. Suppose for the sake of contradiction that X n — > X for 
some random variable. Then for any e > and 6 > 0, if n is sufficiently large, P{\X n — X\ > e} < S. 
But because the sequence is periodic, it must be that P{|X n — X\ > e} < S for 1 < n < 3. Since S 
is arbitrary it must be that P{|X n — X\ > e} = for 1 < n < 3. Since e is arbitrary it must be that 
P{X = X n } = 1 for 1 < n < 3. Hence, P{X\ = X<i = X3} = 1, which is a contradiction. Thus, 
the sequence does not converge in probability. A similar argument shows it does not converge in 
the m.s. sense, either. 

Even though the sequence fails to converge in a.s., m.s., or p. senses, it can be observed that 
all of the X^s have the same probability distribution. The variables are only different in that the 
places they take their possible values are rearranged. 

Example 2.1.9 suggests that it would be useful to have a notion of convergence that just depends 
on the distributions of the random variables. One idea for a definition of convergence in distribution 
is to require that the sequence of CDFs Fx n {x) converge as n — > 00 for all n. The following example 
shows such a definition could give unexpected results in some cases. 

Example 2.1.10 Let U be uniformly distributed on the interval [0, 1], and for n > 1, let X n = 
^—l — . Let X denote the random variable such that X = for all u. It is easy to verify that 
X n a -4' X and X n h X. Does the CDF of X n converge to the CDF of XI The CDF of X n is 
graphed in Figure 2.7. The CDF Fx„(x) converges to for x < and to one for x > 0. However, 

F v n even F v n odd 




=1 
n 



Figure 2.7: CDF of X n 



(-1)" 



Fx n (0) alternates between and 1 and hence does not converge to anything. In particular, it 
doesn't converge to Fjj^O). Thus, Fx n (x) converges to Fx{x) for all x except x = 0. 

Recall that the distribution of a random variable X has probability mass A at some value x , 
i.e. P{X = x } = A > 0, if and only if the CDF has a jump of size A at x : F{x ) — F{x — ) = A. 
Example 2.1.10 illustrates the fact that if the limit random variable X has such a point mass, then 
even if X n is very close to X, the value Fx n (x) need not converge. To overcome this phenomenon, 
we adopt a definition of convergence in distribution which requires convergence of the CDFs only 
at the continuity points of the limit CDF. Continuity points are defined for general functions in 
Appendix 11.3. Since CDFs are right-continuous and nondecreasing, a point x is a continuity point 
of a CDF F if and only if there is no jump of F at X: i.e. if Fx{x) = Fx(x-). 



2.1. FOUR DEFINITIONS OF CONVERGENCE OF RANDOM VARIABLES 47 

Definition 2.1.11 A sequence {X n : n > 1) of random variables converges in distribution to a 
random variable X if 

lim Fx„(x) = Fx{x) at all continuity points x of Fx- 

Convergence in distribution is denoted by lini n ^oo X n = X d. or X n -4 X. 

One way to investigate convergence in distribution is through the use of characteristic functions. 

Proposition 2.1.12 Let (X n ) be a sequence of random variables and let X be a random variable. 
Then the following are equivalent: 

(l) X r Ax 

(ii) E[f{X n )\ — > E[f{X)\ for any bounded continuous function f . 

(Hi) &x n (u) — > &x(u) for each wel (i.e. pointwise convergence of characteristic functions) 

The relationships among the four types of convergence discussed in this section are given in 
the following proposition, and are pictured in Figure 2.8. The definitions use differing amounts of 
information about the random variables (X n : n > 1) and X involved. Convergence in the a.s. sense 
involves joint properties of all the random variables. Convergence in the p. or m.s. sense involves 
only pairwise joint distributions-namely those of (X n ,X) for all n. Convergence in distribution 
involves only the individual distributions of the random variables to have a convergence property. 
Convergence in the a.s., m.s., and p. senses require the variables to all be defined on the same 
probability space. For convergence in distribution, the random variables need not be defined on 
the same probability space. 







Figure 2.8: Relationships among four types of convergence of random variables. 

Proposition 2.1.13 (a) If X n a 4' X then X n 4 X. 

(b) If X n "4 X then X n 4 X. 

(c) If P{\X n \ < Y} = 1 for all n for some fixed random variable Y with E[Y 2 ] < oo, and if 
X n ^X, then X n "4 X. 
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(d) IfX n hx then X n h X. 

(e) Suppose X n —> X in the p., m.s., or a.s. sense and X n — > Y in the p., m.s., or a.s. sense. 
Then P{X = Y} = 1. That is, if differences on sets of probability zero are ignored, a sequence of 
random variables can have only one limit (if p., m.s., and/or a.s. senses are used). 

(f) Suppose X n -4 X and X n -4 Y. Then X and Y have the same distribution. 
Proof, (a) Suppose X n -4 X and let e > 0. Define a sequence of events A n by 

A n = {uj:\X n (u)-X(u)\<e} 
We only need to show that P{A n ) — > 1. Define B n by 

B n = {uj :| Xk{oj) — X(oj) |< e for all k > n} 
Note that B n C A n and B x C B 2 C • • • so lim^oo P(B n ) = P(B) where B = \J™ =1 B n . Clearly 

B D {oj : lim X n (u) = X(u)} 

so 1 = P(B) = lim n ^ 00 P(B n ). Since P(A n ) is squeezed between P(B n ) and 1, linin^oo P(A n ) = 1, 
so X„ -4 X. 

(b) Suppose X n — >' X and let e > 0. By the Markov inequality applied to |X — X ra | 2 , 

P{|X-X n |>e} < ^ [|X "/" |2] (2.1) 



The right side of (2.1), and hence the left side of (2.1), converges to zero as n goes to infinity. 
Therefore X n -4 X as n — > oo. 

(c) Suppose X n -4 X. Then for any e > 0, 

P{\X\>Y + e} < P{\X-X n \>e}^0 

so that P{\ X |> F + e} = for every e > 0. Thus, P{\ X \< Y} = 1, so that P{\ X - X n \ 2 < 
4Y 2 } = 1. Therefore, with probability one, for any e > 0, 

\X-X n \ 2 < 4Y 2 I {lx _ Xnl > e} + e 2 

so 

E[\X-X n \ 2 \ < 4E[Y 2 I ux _ Xnl > e} ]+e 2 

In the special case that P{Y — L} — 1 for a constant L, the term -E , [F 2 -f{|x-X n |>(:}] is equal to 
L 2 P{\X — X n \ > e}, and by the hypotheses, Pjl^ — X n \ > e} — > 0. Even if y is random, since 

E^F 2 ] < oo and P{|X - X n \ > e} -* 0, it still follows that E[Y 2 I { \ X _ Xn \> e} ] -> as n -^ oo, by 

Corollary 11.6.5. So, for n large enough, £/[|X — X n | 2 ] < 2e 2 . Since e was arbitrary, X n -4' X. 

(d) Assume X n -4 X. Select any continuity point x of F x . It must be proved that linin^oo F Xn (x) 
F x (x). Let e > 0. Then there exists 5 > so that F x (x) < F x (x - 6) + f . (See Figure 2.9.) Now 
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x-6 x 



Figure 2.9: A CDF at a continuity point. 



{X < x - 6} = {X < x - 5, X n < x} U {X < x - 5, X n > x} 
C {X n < x} U {\X - X n \ > 5} 



so 



F x (x -S)< F Xn (x) + P{\ X n -X\> 5}. 

For all n sufficiently large, P{\ X n — X \> 6} < |. This and the choice of 6 yield, for all n sufficiently 
large, Fx(x) < Fx n (x) + e. Similarly, for all n sufficiently large, Fx(x) > Fx N (x) — e. So for all n 
sufficiently large, \Fx n (x) — Fx(x)\ < e. Since e was arbitrary, lim n ^ 00 Fx n (x) = Fx(x). 

(e) By parts (a) and (b), already proved, we can assume that X n -4 X and X n -4 Y. Let e > 
and 5 > 0, and select N so large that P{\ X n - X\ > e} < S and P{\ X n - Y\ > e} < 5 for all n> N. 
By the triangle inequality, \X -Y\< \X N - X\ + \X N - Y\. Thus, 

{\X -Y\> 2e} C {\X N - X\ > e} U {\Y N - X\ > e} so that 

^{1^ - Y\ > 2e} < P{\X N - X\ > e} + P{\X N - Y\ > e} < 25. We've proved that 

P{|X — Y\ > 2e} < 2(5. Since 5 was arbitrary, it must be that -P{|^ — Y\ > 2e} = 0. Since e was 

arbitrary, it must be that P{|X — Y\ =0} = 1. 

(f) Suppose X n -4 X and X n -4 Y. Then Fx{x) = Fy{y) whenever x is a continuity point of 
both x and y. Since Fx and Fy are nondecreasing and bounded, they can have only finitely many 
discontinuities of size greater than 1/n for any n, so that the total number of discontinuities is at 
most countably infinite. Hence, in any nonempty interval, there is a point of continuity of both 
functions. So for any x £ M, there is a strictly decreasing sequence of numbers converging to x, 
such that x n is a point of continuity of both Fx and Fy- So Fx{x n ) = Fy{x n ) for all n. Taking 
the limit asmoo and using the right-continuitiy of CDFs, we have Fx(x) = Fy(x). I 



Example 2.1.14 Suppose Xq is a random variable with P{Xq > 0} = 1. Suppose X n 
for n > 1. For example, if for some to it happens that Xq{oj) = 12, then 

Xi(w) _ 

X 2 (w) 

X 3 (w) 



6+V^n-l 



6 + V12 = 9.465 . . . 
6 + 7946 = 9.076... 
6 + V9.076 = 9.0127... 



Examining Figure 2.10, it is clear that for any uj with Xq(u) > 0, the sequence of numbers X n (uj) 
converges to 9. Therefore, X n -4' 9 The rate of convergence can be bounded as follows. Note that 
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Figure 2.10: Graph of the functions 6 + sfx and 6 + f . 



for each x > 0, | 6 + \fx — 9 | < | 6 + | — 9 |. Therefore, 

I v ( \ nl s I a i ^n-l(^) „ , 

| A n (wj — 9 | < \b -\ 9 | = 



ViH-9 



so that by induction on n, 



X n (u) - 9 | < 3" n | X (u) - 9 



(2.2) 



Since X n -V 9 it follows that X n — > 9. 

Finally, we investigate m.s. convergence under the assumption that -E[Xq] < +oo. By the 
inequality (a + b) 2 < 2a 2 + 2b 2 , it follows that 

E[(X -9) 2 } < 2{E[X 2 ) + 81) (2.3) 

Squaring and taking expectations on each side of (2.10) and using (2.3) thus yields 

E[\ X n - 9 | 2 ] < 2 • Z- 2n {E[Xl\ + 81} 

Therefore, X n ™ m 9. 



Example 2.1.15 Let Wq, W±, ... be independent, normal random variables with mean and vari- 
ance 1. Let X-i = and 



X n 



(.9)X n _i + W n 



n > 



In what sense does X n converge as n goes to infinity? For fixed u, the sequence of numbers 
Xo(u) , Xi(u>) , . . . might appear as in Figure 2.11. 

Intuitively speaking, X n persistently moves. We claim that X n does not converge in probability 
(so also not in the a.s. or m.s. senses). Here is a proof of the claim. Examination of a table for the 
normal distribution yields that P{W n > 2} = P{W n < -2} > 0.02. Then 

P{\ X n - X n _ x |> 2} > P{X n _! > 0, W n < -2} + P{X„_! < 0, W n > 2} 

= P{X n _! > 0}P{W n < -2} + P{X n _! < 0}P{W n > 2} 
= P{W n > 2} > 0.02 
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X k ' ' • ••* 

. * • * • 
• • k 



Figure 2.11: A typical sample sequence of X. 

Therefore, for any random variable X, 

P{\ X n - X \> 1} + P{\ X n _ t - X \> 1} > P{\X n -X\>loi \X n -!-X\>l} 

> P{|l n -Vi |> 2} > 0.02 

so P{\ X n — X |> 1} does not converge to zero as n — > oo. So X n does not converge in probability 
to any random variable X. The claim is proved. 

Although X n does not converge in probability, or in the a.s. or m.s.) senses, it nevertheless seems 
to asymptotically settle into an equilibrium. To probe this point further, let's find the distribution 
of X n for each n. 

X = W is N(0, 1) 

X 1 = (.9)X + W x is N(0, 1.81) 

X 2 = (.9)X 1 + W 2 is7V(0,(.81)(1.81 + l)) 

In general, X n is A r (0, a 2 ) where the variances satisfy the recursion a 2 = (0.81)(T^_ 1 + 1 so a 2 — > a 2 ^ 
where a 2 ^ = g^g = 5.263. Therefore, the CDF of X n converges everywhere to the CDF of any 

random variable X which has the iV(0, a 2 ^) distribution. So X n -4 X for any such X. 

The previous example involved convergence in distribution of Gaussian random variables. The 
limit random variable was also Gaussian. In fact, we close this section by showing that limits of 
Gaussian random variables are always Gaussian. Recall that X is a Gaussian random variable with 
mean \x and variance a 2 if either a 2 > and Fx(c) = $(^ I= ^) for all c, where $ is the CDF of the 

r {c>Ad 



standard N(0, 1) distribution, or a 2 = 0, in which case Fx(c) = I{ c >a\ an d P{X = ^1 = 1. 



Proposition 2.1.16 Suppose X n is a Gaussian random variable for each n, and that X n — > X^ 
as n — > oo, in any one of the four senses, a.s., m.s., p., or d. Then X^ is also a Gaussian random 
variable. 

Proof. Since convergence in the other senses implies convergence in distribution, we can assume 
that the sequence converges in distribution. Let fi n and a 2 denote the mean and variance of X n . 
The first step is to show that the sequence a 2 is bounded. Intuitively, if it weren't bounded, the 
distribution of X n would get too spread out to converge. Since Fx^ is a valid CDF, there exists 
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a value L so large that Fx^i—L) < | and Fx^iL) > |. By increasing L if necessary, we can also 
assume that L and — L are continuity points of Fx- So there exists n such that, whenever n > n , 
F Xn (-L) < \ and F Xn (L) > §. Therefore, for n > n , P{|X„| < L} > F Xn (§) - F* n (§) > |. For 
<t^ fixed, the probability P{|X n | < L} is maximized by [i n = 0, so no matter what the value of \i n 
is, 2$(^) - 1 > P{\X n \ < L}. Therefore, for n > n , $(^) > §, or equivalently, a n < L/$ _1 (§), 
where <5 -1 is the inverse of <5. The first n — l terms of the sequence (<r^) are finite. Therefore, the 
whole sequence (of,,) is bounded. 

Constant random variables are considered to be Gaussian random variables-namely degenerate 
ones with zero variance. So assume without loss of generality that X^ is not a constant random 
variable. Then there exists a value c so that Fx x (c ) is strictly between zero and one. Since Fx x 
is right-continuous, the function must lie strictly between zero and one over some interval of positive 
length, with left endpoint c . The function can only have countably many points of discontinuity, 
so it has infinitely many points of continuity such that the function value is strictly between zero 
and one. Let c\ and C2 be two distinct such points, and let p\ and pi denote the values of Fx^ at 
those two points, and let 6j = 3> -1 (j>j) for i — 1,2. It follows that WvHn^^ Ci ~^ n = \ >i for i — 1,2. 
The limit of the difference of the sequences is the difference of the limits, so \\va. n ^ 00 Cl ~ C2 = 61 — 62- 
Since c\ — C2 / and the sequence (a n ) is bounded, it follows that (a n ) has a finite limit, <7oo, an< i 
therefore also (fi n ) has a finite limit, /ioo- Therefore, the CDFs Fx n converge pointwise to the CDF 
for the N(fi 00 ,(T^ ) distribution. Thus, X^ has the N {jioo , o"^) distribution. ■ 



2.2 Cauchy criteria for convergence of random variables 

It is important to be able to show that a limit exists even if the limit value is not known. For 
example, it is useful to determine if the sum of an infinite series of numbers is convergent without 
needing to know the value of the sum. One useful result for this purpose is that if {x n : n > 1) 
is monotone nondecreasing, i.e. x\ < X2 < • • • , and if it satisfies x n < L for all n for some 
finite constant L, then the sequence is convergent. This result carries over immediately to random 
variables: if {X n : n > 1) is a sequence of random variables such P{X n < X n+ \} = 1 for all n and 
if there is a random variable Y such that P{X n < Y} = 1 for all n, then (X n ) converges a.s. 

For deterministic sequences that are not monotone, the Cauchy criteria gives a simple yet general 
condition that implies convergence to a finite limit. A deterministic sequence (x n : n > 1) is said 
to be a Cauchy sequence if linim^^oo \x m — x n \ = 0. This means that, for any e > 0, there exists N 
sufficiently large, such that \x m — x n \ < e for all m,n > N. If the sequence (x n ) has a finite limit 
Xqo, then the triangle inequality for distances between numbers, \x m — x n \ < \x m — Xoo| + \x n — x^l, 
implies that the sequence is a Cauchy sequence. More useful is the converse statement, called the 
Cauchy criteria for convergence, or the completeness property of R: If (x n ) is a Cauchy sequence 
then (x n ) converges to a finite limit as n — > 00. The following proposition gives similar criteria for 
convergence of random variables. 

Proposition 2.2.1 (Cauchy criteria for random variables) Let (X n ) be a sequence of random 
variables on a probability space (O,^ 7 , P). 
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(a) X n converges a.s. to some random variable if and only if 

P{uj : lim \X m (u) - X n (u)\ = 0} = 1. 

(b) X n converges m.s. to some random variable if and only if (X n ) is a Cauchy sequence in the 

m.s. sense, meaning E[X 2 ] < +cxd for all n and 

lim E[(X m - X n ) 2 } = 0. (2.4) 

m,n—> oo 

(c) X n converges p. to some random variable if and only if for every e > 0, 

lim P{\X m - X n \ > e} = 0. (2.5) 



m,n—>oo 



Proof, (a) For any ui fixed, (X n (u) : n > 1) is a sequence of numbers. So by the Cauchy criterion 
for convergence of a sequence of numbers, the following equality of sets holds: 

{u> : lim X n (u) exists and is finite} = {ui : lim \X m (u) — X n (u)\ = 0}. 

Thus, the set on the left has probability one (i.e. X converges a.s. to a random variable) if and 
only if the set on the right has probability one. Part (a) is proved. 

(b) First the "only if part is proved. Suppose X n -V X^. By the L 2 triangle inequality for 
random variables, 

E[(X n - X m ) 2 } h < E[{X m - Xoo) 2 } h + E[(X n - Xoo) 2 ] \ (2.6) 

Since X n — *' X^. the right side of (2.6) converges to zero asm,m oo, so that (2.4) holds. The 
"only if part of (b) is proved. 

Moving to the proof of the "if part, suppose (2.4) holds. Choose the sequence k\ < &2 < • • ■ 
recursively as follows. Let k\ be so large that E[(X n — X^) 2 ] < 1/2 for all n > k\. Once k\, . . . , ki-\ 
are selected, let ki be so large that ki > k%-\ and E[(X n — X^) 2 ] < 2~ l for all n > h L . It follows from 
this choice of the fej's that E[(X k . +1 -X k .) 2 } < 2 _i for all i > 1. Let S n = \X kl \+Y^=i \ x k i+1 ~X ki \- 
Note that \X ki \ < S n for 1 < i < k by the triangle inequality for differences of real numbers. By 
the L 2 triangle inequality for random variables (1.18), 

n-l 

E[S 2 J < E[X 2 ki }-2 + J2 E[(X ki+1 - X k f} k 2 < E[X 2 ki \h + l. 
i=i 

Since S n is monotonically increasing, it converges a.s. to a limit 5^. Note that \X k .\ < Soo for 
all i > 1. By the monotone convergence theorem, -^[S^J = liuin^oo E[S 2 ] < {E\X 2 J2 + l) 2 . So, 
Sqo is in L 2 {Q,T,P). In particular, Sex, is finite a.s., and for any uj such that S' 0O (w) is finite, the 
sequence of numbers (X k .(u) : i > 1) is a Cauchy sequence. (See Example 11.2.3 in the appendix.) 
By completeness of R, for uj in that set, the limit Xoo(uj) exists. Let Xoo(uj) = on the zero 
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probability event that (X ki (u) : i > 1) does not converge. Summarizing, we have lim^oo X ki = Xoo 
a.s. and l-X^J < Sqo where 5^ £ L 2 (H,^ r , P). It therefore follows from Proposition 2.1.13(c) that 

v m.s. v 

The final step is to prove that the entire sequence {X n ) converges in the m.s. sense to X^. 
For this purpose, let e > 0. Select i so large that E[(X n — X k .) 2 ] < e 2 for all n > k- L , and 
E[(X ki — -^oo) 2 ] < e 2 - Then, by the L 2 triangle inequality, for any n > ki, 

EliXn-X^ < E{X n - X ki ) 2 )^ + E[(X ki - Xoo) 2 ]! <2e 

Since e was arbitrary, X n -V X^. The proof of (b) is complete. 

(c) First the "only if part is proved. Suppose X n -4 X^. Then for any e > 0, 

P{\X m - X n \ > 2e} < P{\X m - Xoo| > e} + P{\X m - X«,| > e} -> 

as m, n — > oo, so that (2.5) holds. The "only if part is proved. 

Moving to the proof of the "if part, suppose (2.5) holds. Select an increasing sequence of 
integers ki so that P{\X n — X m \ > 2~' 1 } < 2~ % for all m,n > ki. It follows, in particular, that 
P{\Xk i+1 — X ki \ > 2~ 1 } < 2~ % . Since the sum of the probabilities of these events is finite, the prob- 
ability that infinitely many of the events is true is zero, by the Borel-Cantelli lemma (specifically, 
Lemma 1.2.2(a)). Thus, P{\X ki+1 — X k .\ < 2~ % for all large enough i} = 1. Thus, for all ui is a 
set with probability one, (X ki (uj) : i > 1) is a Cauchy sequence of numbers. By completeness of 
1R, for ui in that set, the limit X^lo) exists. Let X^uo) = on the zero probability event that 
(Xfc.(w) : i > 1) does not converge. Then, X ki -4 X^. It follows that X ki — > Xoo as well. 

The final step is to prove that the entire sequence (X n ) converges in the p. sense to X^. For 
this purpose, let e > 0. Select i so large that P{||X n — X ki \\ > e} < e for all n > ki, and 
P{\X ki — Xoo| > e} < e. Then P{\X n — X^l > 2e} < 2e for all n > ki. Since e was arbitrary, 
X n -4 Xoo. The proof of (c) is complete. I 

The following is a corollary of Proposition 2.2.1(c) and its proof. 

Corollary 2.2.2 If X n -4 Xoo, then there is a subsequence {X k . : i > 1) such that lim^oo X k . = 
Xoo a.s. 

Proof. By Proposition 2.2.1(c), the sequence satisfies (2.2.1). By the proof of Proposition 2.2.1(c) 
there is a subsequence (X^.) that converges a.s. By uniqueness of limits in the p. or a.s. senses, the 
limit of the subsequence is the same random variable, X^ (up to differences on a set of measure 
zero) . I 

Proposition 2.2.1(b), the Cauchy criteria for mean square convergence, is used extensively in 
these notes. The remainder of this section concerns a more convenient form of the Cauchy criteria 
for m.s. convergence. 

Proposition 2.2.3 (Correlation version of the Cauchy criterion for m.s. convergence) Let {X n ) 
be a sequence of random variables with E[X^\ < +oo for each n. Then there exists a random 
variable X such that X n — *' X if and only if the limit linim^^oo E[X n X m ] exists and is finite. 
Furthermore, if X n — >' X, then lim mjrwoo E[X n X m ] = E[X 2 ]. 
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proof The "if part is proved first. Suppose lim mjrwoo E[X n X m ] = c for a finite constant c. Then 

E[{X n -X m ) 2 \ = E[Xl\ - 2E[X n X m ] + E[X 2 m ] 
— > c — 2c + c = 0asm,n—>oo 



Thus, X n is Cauchy in the m.s. sense, so X n -V X for some random variable X. 
To prove the "only if part, suppose X n -V X. Observe next that 



E[X m X n ] = E[{X + {X m -X)){X + {X n -X))} 

= E[X 2 + (X m - X)X + X(X n -X) + (X m - X)(X n - X)} 

By the Cauchy- Schwarz inequality, 

E[\{X m -X)X\) < E[{X m -X) 2 )lE[X 2 )^ -0 
E[\(X m -X)(X n -X)\] < E[{X m -X) 2 )hE[{X n -X) 2 ) ) i -^ 

and similarly E[\ X(X n - X) \] -> 0. Thus E[X m X n ] -> E[X 2 }. This establishes both the "only if 
part of the proposition and the last statement of the proposition. The proof of the proposition is 
complete. □ 

Corollary 2.2.4 Suppose X n "^ X and Y n ™4' Y. Then E[X n Y n ] -► E[XY\. 

Proof. By the inequality (a + b) 2 < 2a? + 2b 2 , it follows that X n + Y n "^1' X + Y as n —> oo. 
Proposition 2.2.3 therefore implies that E[(X n + Y n ) 2 ) -> E[(X + Y) 2 }, E[X 2 } -> E[X 2 }, and 
E[Y 2 } -> £[F 2 ]. Since X n y n = ((X n + Y n ) 2 - X 2 - Y 2 )/2, the corollary follows. ■ 



m.s. 



Corollary 2.2.5 Suppose X n -% X. Then E[X n ] -» £[X]. 

Proof. Corollary 2.2.5 follows from Corollary 2.2.4 by taking Y n = 1 for all n. I 

Example 2.2.6 This example illustrates the use of Proposition 2.2.3. Let X±,X2, ■ ■ ■ be mean 
zero random variables such that 

™ - {ill: 3 

Does the series YlkLi ~k~ conver g e hi the mean square sense to a random variable with a finite second 
moment? Let Y n = 2fc=i "If- The question is whether Y„ converges in the mean square sense to 
a random variable with finite second moment. The answer is yes if and only if lim mirwoo ^[Y^jiy 
exists and is finite. Observe that 



-t^l^m^nl 



min(m,n) 

fc=l 

oo _. 

^ — g as m, n — > oo 
fe=i 
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This sum is smaller than 1 + J^° —dx = 2 < oo. 1 Therefore, by Proposition 2.2.3, the series 
SjfcLi "jr indeed converges in the m.s. sense. 

2.3 Limit theorems for sums of independent random variables 

Sums of many independent random variables often have distributions that can be characterized 
by a small number of parameters. For engineering applications, this represents a low complexity 
method for describing the random variables. An analogous tool is the Taylor series approximation. 
A continuously differentiable function / can be approximated near zero by the first order Taylor's 
approximation 

/(x)»/(0) + x/'(0) 

A second order approximation, in case / is twice continuously differentiable, is 

/(x)»/(0) + x/'(0) + y/"(0) 

Bounds on the approximation error are given by Taylor's theorem, found in Appendix 11.4. In 
essence, Taylor's approximation lets us represent the function by the numbers /(0), /'(0) and 
/"(0). We shall see that the law of large numbers and central limit theorem can be viewed not just 
as analogies of the first and second order Taylor's approximations, but actually as consequences of 
them. 



as n — > oo. 



Lemma 2.3.1 If x n — > x as n — > oo then (1 + ^-) n — > e x 

Proof. The basic idea is to note that (1 + s) n = exp(nln(l + s)), and apply Taylor's theorem to 
ln(l + s) about the point s — 0. The details are given next. Since ln(l+s)| s= o = 0, ln(l+s)'| s= o = 1, 
and ln(l + s)" = — ,, -, 2 > the mean value form of Taylor's Theorem (see the appendix) yields that 

2 

if s > — 1, then ln(l + s) = s — „,/_ s 2 , where y lies in the closed interval with endpoints and s. 
Thus, if s > 0, then y > 0, so that 

s 2 
s < ln(l + s) < s if s > 0. 

More to the point, if it is only known that s > — g, then y > — g, so that 

s-2s 2 < ln(l + s) < s if a > — 

Letting s = —, multiplying through by n, and applying the exponential function, yields that 

/ 2x 2 . \ / x n \ n n 

exp I x n I < I H I < exp(x n ) if x n > -- 



i 
In fact, the sum is equal to ^-, but the technique of comparing the sum to an integral to show the sum is finite 

is the main point here. 
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If x n — > x as n — > oo then the condition x ra > 
yielding the desired result. 



holds for all large enough n, and x n 



■2.n, 



A sequence of random variables (X n ) is said to be independent and identically distributed (iid) 
if the Xi's are mutually independent and identically distributed. 

Proposition 2.3.2 (Law of large numbers) Suppose that X\, X2, ... is a sequence of random vari- 
ables such that each X{ has finite mean m. Let S n = X\ + • • • + X n . Then 

(a) — -4' m. (hence also — -4 m and — — > m.) if for some constant c, Var(JQ) < c for all i, 
and Cov(Xi, Xj) — i 7^ j (i.e. if the variances are bounded and the X^s are uncorrelated) . 

(b) — -4 m if X\,X2, ■ ■ ■ are iid. (This version is the weak law of large numbers.) 

(c) ^ -4' m if X\, X2, ■ ■ ■ are iid. (This version is the strong law of large numbers.) 

We give a proof of (a) and (6), but prove (c) only under an extra condition. Suppose the conditions 
of (a) are true. Then 



E 



n 



m 



Var 



J a. 



it 



1 



r\. 



Var(5„) 



= ^EE Cov (^> ^) 4^ Var( ^ } - 



c 

n 



Therefore — -4' m. 

n 

Turn next to part (6). If in addition to the conditions of (6) it is assumed that Var(Xi) < +00, 
then the conditions of part (a) are true. Since mean square convergence implies convergence in 
probability, the conclusion of part (6) follows. An extra credit problem shows how to use the same 
approach to verify (6) even if Var(Xi) = +00. 

Here a second approach to proving (b) is given. The characteristic function of — i is given by 



E 



exp 



juXi 



11 



E 



exp (.7 1-1 Xi 



$ 



X 



u 



II 



where &x denotes the characteristic function of X\. Since the characteristic function of the sum 
of independent random variables is the product of the characteristic functions, 



®Sn(u) 



<I> 



x 



u\\ n 



II 



Since E{X\) = m it follows that &x is differentiable with $x(0) = 1, $'x(Q) = jm and $' is 
continuous. By Taylor's theorem, for any u fixed, 



<I> 



x 



11 



1 + 



u& x (u n ) 



II 
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for some u n between and ^ for all n. Since $'(«„) — > jm as n — > oo, Lemma 2.3.1 yields 
^x(") n - * exp(jum) as in oo. Note that exp(jum) is the characteristic function of a random 
variable equal to m with probability one. Since pointwise convergence of characteristic functions to 
a valid characteristic function implies convergence in distribution, it follows that — -4 m. However, 
convergence in distribution to a constant implies convergence in probability, so (b) is proved. 

Part (c) is proved under the additional assumption that £?[-X"i] < +oo. Without loss of generality 
we assume that EX\ = 0. Consider expanding S*- There are n terms of the form Xf and 3n(n— 1) 
terms of the form X 2 X 2 with 1 < i,j < n and i / j. The other terms have the form XfXj,XfXjXk 
or XiXjX^Xi for distinct i,j, k, I, and these terms have mean zero. Thus, 

E[S%\ = nE[Xf]+3n(n-l)E[X 2 ] 2 

Let Y = ^^Li(— ) 4 - The value of Y is well defined but it is a priori possible that Y(uj) = +oo for 
some u. However, by the monotone convergence theorem, the expectation of the sum of nonnegative 
random variables is the sum of the expectations, so that 



E[Y] = Y,E 



71=1 



^ nE[Xf} + 3n(n - 1)£[X 2 ] 2 
E^ ^3 ^^^<+oo 



n=l 



Therefore, P{Y < +00} = 1. However, {Y < +00} is a subset of the event of convergence 

{w : n ^ w ' — > as n — > 00}, so the event of convergence also has probability one. Thus, part (c) 

under the extra fourth moment condition is proved. 

Proposition 2.3.3 (Central Limit Theorem) Suppose that X\,X2, ■ ■ ■ are i.i.d., each with mean 
fi and variance a 2 . Let S n = X\ + • • • + X n . Then the normalized sum 

S n - n/i 



n 
converges in distribution to the iV(0,o- 2 ) distribution as n — > 00. 

Proof. Without loss of generality, assume that fi = 0. Then the characteristic function of 
the normalized sum — p= is given by &x(-i=) n , where 4>x denotes the characteristic function of X\. 
Since X\ has mean and finite second moment a 2 , it follows that &x is twice differentiable with 
^x(O) = 1, $x(0) = 0) ^(0) = — a 2 , and $ x is continuous. By Taylor's theorem, for any u fixed, 

u \ u 



®x[^ = 1 + — <&"K) 

'n 2n 



for some u n between and -j= for all n. Since &"(u n ) — > —a 2 as in (50, Lemma 2.3.1 yields 

2 2 

3>x(-7^) n — * exp(— :yi y-) as n — > 00. Since pointwise convergence of characteristic functions to a 
valid characteristic function implies convergence in distribution, the proposition is proved. I 
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2.4 Convex functions and Jensen's inequality 

Let ip be a function on R with values in R U {+00} such that <p(x) < 00 for at least one value of x. 
Then ip is said to be convex if for any a, b and A with a < b and < A < 1 

<p(aX + b(l - X)) < \(p(a) + (l-X)(p(b). 

This means that the graph of ip on any interval [a, b] lies below the line segment equal to 92 at the 
endpoints of the interval. 

Proposition 2.4.1 Suppose f is twice continuously differentiable on R. Then f is convex if and 
only if f"{y) > for all v. 

Proof. Suppose that / is twice continuously differentiable. Given a < b, define D ab = Xf{a) + 
(1 - A)/(6) - f(Xa + (1 - X)b). For any x > a, 



x rx 



f{x) = /(a) + / f'(u)du = f(a) + (x- a)f'(a) + / / f"{v)dvdu 

Ja J a Jv 

= /(a) + {x- o)/'(o) + f\x - v)f"{v)dv 

J a 

Applying this to D a b yields 

D ab = I min{A(« - a), (1 - A) (6 - v)}f"(v)dv (2.7) 

J a 

On one hand, if f"{v) > for all v, then D ab > whenever a < b, so that / is convex. On the 
other hand, if f"(v ) < for some v , then by the assumed continuity of /", there is a small open 
interval (a, b) containing v so that f"(v) < for a < v < b. But then D a 5 < 0, implying that / is 
not convex. I 

Examples of convex functions include: 

ax + bx + c for constants a, b, c with a > 0, 
e x for A constant, 



f(x) 



— In x x > 

+00 x < 0, 



xlnx a; > 
<p(x) = { x = 

+00 x < 0. 

Theorem 2.4.2 (Jensen's inequality) Let ip be a convex function and let X be a random variable 
such that E[X] is finite. Then E[<p(X)] > <p(E[X]). 
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For example, Jensen's inequality implies that P[X 2 ] > PLY] 2 , which also follows from the fact 
Var(X) = E[X 2 } - E[X} 2 . 

Proof. Since ip is convex, there is a tangent to the graph of <p at P[-Y], meaning there is a 
function L of the form L{x) = a + bx such that (f(x) > L(x) for all x and <p(E[X]) = L{E[X\). 
See the illustration in Figure 2.12. Therefore E[(p(X)] > E[L(X)] = L(E[X]) = ip(E[X]), which 
establishes the theorem. U 




L(x) 



Figure 2.12: A convex function and a tangent linear function. 



A function ip is called concave if —ip is convex. If ip is concave then E[ip(X)] < <p(E[X]). 



2.5 Chernoff bound and large deviations theory 

Let X\, X2, ... be an iid sequence of random variables with finite mean /a, and let S n = X\ + - ■ -+X n . 
The weak law of large numbers implies that for fixed a with a > //, P{^ > a} — > as n — > 00. In 
case the X^s have finite variance, the central limit theorem offers a refinement of the law of large 
numbers, by identifying the limit of P{^ > a n }, where (a n ) is a sequence that converges to jj, in 
the particular manner: a n = fj,+ -j=. For fixed c, the limit is not zero. One can think of the central 
limit theorem, therefore, to concern "normal" deviations of S n from its mean. Large deviations 
theory, by contrast, addresses P{^ > a} for a fixed, and in particular it identifies how quickly 
P{-^p > a} converges to zero as n — > 00. We shall first describe the Chernoff bound, which is a 
simple upper bound on P{^- > a}. Then Cramer's theorem, to the effect that the Chernoff bound 
is in a certain sense tight, is stated. 

The moment generating function of X\ is defined by M{9) = E[e dXl ], and lnM(0) is called the 



log moment generating function. Since 



„8Xi 



is a positive random variable, the expectation, and 



hence M(6) itself, is well-defined for all real values of 6, with possible value +00. The Chernoff 
bound is simply given as 



/ ' < ^ > a 
n 



< exp(-n[6»a - In M(6)}) for > 



(2. 



The bound (2.8), like the Chebychev inequality, is a consequence of Markov's inequality applied to 
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an appropriate function. For 9 > 0: 



{— >a] = P { e 0(Xi+-+X n -na) > 1} 



< E[e e{Xl+ '" +Xn ~ na) ] 

= E[e eXl ] n e~ nea = exp(-n[0a - lnM(0)]) 

To make the best use of the Chernoff bound we can optimize the bound by selecting the best 9. 
Thus, we wish to select 9 > to maximize aO — \nM(9). 

In general the log moment generating function InM is convex. Note that lnM(O) = 0. Let us 
suppose that M(9) is finite for some 9 > 0. Then 



dlnM(e) 



(10 



E[X ie 



ex. 



E[e ex i] 



E[X X 



(9=0 



The sketch of a typical case is shown in Figure 2.13. Figure 2.13 also shows the line of slope a. 



MM(Q) 




Figure 2.13: A log moment generating function and a line of slope a. 

Because of the assumption that a > £J[Xi], the line lies strictly above lnM(#) for small enough 9 
and below InM (9) for all 9 < 0. Therefore, the maximum value of 9a — InM (9) over 9 > is equal 
to 1(a), defined by 



1(a) = sup 9a- In M(9) 

— oo<9<oo 



(2.9) 



Thus, the Chernoff bound in its optimized form, is 



P{ — >a\ < exp(-nl(o)) a > E[X X \ 



There does not exist such a clean lower bound on the large deviation probability P{-£ > a}, 
but by the celebrated theorem of Cramer stated next, the Chernoff bound gives the right exponent. 
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Theorem 2.5.1 (Cramer's theorem) Suppose E[X\] is finite, and that E[X\\ < a. Then for e > 
there exists a number n e such that 

Pi — >a\ > exp(-n(l(a) + e)) (2.10) 

for all n > n e . Combining this bound with the Chernoff inequality yields 

lim -mpj — >a\ = -1(a) 
n^oo n [ n J 

In particular, if 1(a) is finite (equivalently if P{X\ > a} > 0) then 

P<^>a\ = exp(-n(Z(a) + £„)) 

where (e n ) is a sequence with e n > and lin^^oo e n = 0. 
Similarly, if a < E[Xi] and 1(a) is finite, then 

P{^<a| = exp(-n(i(o) + £„)) 

where e n is a sequence with e n > and linin^oo e n = 0. Informally, we can write for n large: 

p{—£da\^e- nl ^da (2.11) 

Proof. The lower bound (2.10) is proved here under the additional assumption that X\ is a 
bounded random variable: P{|Xi| < C} = 1 for some constant C; this assumption can be removed 
by a truncation argument covered in a homework problem. Also, to avoid trivialities, suppose 
P{X\ > a} > 0. The assumption that X\ is bounded and the monotone convergence theorem 
imply that the function M(9) is finite and infinitely differentiable over S Gl. Given 9 £ R, let 
Pq denote a new probability measure on the same probability space that X\, X2, ■ ■ ■ are defined on 
such that for any n and any event of the form {(^1, • • • , X n ) G B}, 



M(6) n 

In particular, if Xi has pdf / for each i under the original probability measure P, then under 
the new probability measure Pq, each X^ has pdf f$ defined by fe(x) = ^W) ' an< ^ ^ e ran< iom 
variables X±,X2, ■ ■ ■ are independent under Pq. The pdf fg is called the tilted version of / with 
parameter 9, and Pq is similarly called the tilted version of P with parameter 9. It is not difficult 
to show that the mean and variance of the X^s under Pq are given by: 

Var 9 [Xi] = E,[Ji:?]-E 8 [Ji-i] 2 = (liiM(«))" 
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Under the assumptions we've made, X\ has strictly positive variance under Pg for all 9, so that 
lnM(#) is strictly convex. 

The assumption P{X\ > a} > implies that [a6 — In M(6)) — > — oo as 9 — > oo. Together with 
the fact that lnM(0) is differentiable and strictly convex, there thus exists a unique value 9* of 
9 that maximizes a9 - lnM(0). So /(a) = a6>* - In M(6>*). Also, the derivative of a^ - lnM(0) at 

= 0* is zero, so that -E#*LY] = (lnM(0))' = a. Observe that for any b with b > a, 



P< — >a 

n 



1 dP 



{o^na^Sn} 



{aj:na<S' n } 

M(r) n 



M(9*) n e 



*\n-6*S n e ar 



M(9*) n 



e- e * s "dP e * 



{w.na<S n } 

> M(9*) n f e- e ' Sn dP e * 

J {tx>:na<S n <.nb} 

> M(9*) n e- e * nb P e *{na<S n <nb} 

Now M(9*) n e~ e * nb = exp(-n(/(a) + 6*(b - a)}), and by the central limit theorem, Pg*{na < S n < 
nb} — > g as n ~^ °° so Pe*{^a < iS n < n&} > 1/3 for n large enough. Therefore, for n large enough, 



!'{ — > a\ > exp f-n (l(a) + 9*{b - a) + — 



Taking b close enough to a, implies (2.10) for large enough n. 



Example 2.5.2 Let X±, X2, ... be independent and exponentially distributed with parameter A 
1. Then 



lnM(0) = ln/ e 0x e~ x dx 
'0 



ln(l-9) e<l 

+cx) e > 1 



See Figure 2.14 



Figure 2.14: lnM(#) and 1(a) for an Exp(l) random variable. 
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Therefore, for any oeK, 



1(a) = max{a6>-lnM(#)} 

6 

= max{a# + ln(l - 9)} 



If a < then 1(a) = +oo. On the other hand, if a > then setting the derivative of aO + ln(l 
to yields the maximizing value 9=1 , and therefore 



1(a) 



a — 1 — ln(a) a > 
+ oo a < 



The function / is shown in Figure 2.14. 



Example 2.5.3 Let X\, X2, ... be independent Bernoulli random variables with parameter p sat- 
isfying < p < 1. Thus S n has the binomial distribution. Then lnM(#) = ln(pe e + (1 — p)), which 

has asymptotic slope 1 as 9 — > +00 and converges to a constant as 9 — > — 00. Therefore, 1(a) = +00 
if a > 1 or if a < 0. For < a < 1, we find a9 — In M(9) is maximized by # = ln( a L_^ ), leading to 



/(a) 



oln(j*) + (l-o)ln(iEf) 0<a<l 

+00 else 



See Figure 2.15. 




Figure 2.15: lnM(#) and 1(a) for a Bernoulli distribution. 



2.6 Problems 



2.1 Limits and infinite sums for deterministic sequences 

(a) Using the definition of a limit, show that lim^o 9(1 + cos(#)) = 0. 

(b) Using the definition of a limit, show that linie^o,e>o — °g = +00. 

(c) Determine whether the following sum is finite, and justify your answer: YlnLi 1 



1+n 2 
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2.2 The limit of the product is the product of the limits 

Consider two (deterministic) sequences with finite limits: lim n ^ 00 x n = x and linin^oo y n = y. 

(a) Prove that the sequence (y n ) is bounded. 

(b) Prove that lim n ^oo x n y n = xy. (Hint: Note that x n y n — xy = (x n — x)y n + x(y n — y) and use 
part (a)). 

2.3 The reciprocal of the limit is the limit of the reciprocal 

Using the definition of converence for deterministic sequences, prove that if (x„) is a sequence with 
a nonzero finite limit Xqo, then the sequence (l/x n ) converges to 1/xoo- 

2.4 Limits of some deterministic series 

Determine which of the following series are convergent (i.e. have partial sums converging to a finite 
limit). Justify your answers. 



oo 



w££ WES mE ' 



ni '—' (n + 5) 3 ^ (ln(n + l)) 5 ' 

n=0 re=l v ' n=l ^ v " 

2.5 On convergence of deterministic sequences and functions 

(a) Let x n = " \ n for n > 1. Prove that linin^oo x n = I. 

(b) Suppose / n is a function on some set D for each n > 1, and suppose / is also a function on 
D. Then f n is defined to converge to / uniformly if for any e > 0, there exists an n e such that 
\fn(x) — f(x)\ < e for all x £ D whenever n > n e . A key point is that n e does not depend on x. 
Show that the functions f n (x) = x n on the semi-open interval [0, 1) do not converge uniformly to 
the zero function. 

(c) The supremum of a function / on D, written sup^ /, is the least upper bound of /. Equivalently, 
su Pd / satisfies sup^ / > f(x) for all x £ D, and given any c < sup^, /, there is an x £ D such 
that f(x) > c. Show that | sup^ / — sup D g\ < sup^ \f — g\. Conclude that if f n converges to / 
uniformly on D, then sup^ f n converges to sup^) /. 

2.6 Convergence of sequences of random variables 

Let G be uniformly distributed on the interval [0,27r]. In which of the four senses (a.s., m.s., p., 
d.) do each of the following two sequences converge. Identify the limits, if they exist, and justify 
your answers. 

(a) {X n : n > 1) defined by X n = cos(nG). 

(b) (Y n : n > 1) defined by Y n = |1 - f \ n . 

2.7 Convergence of random variables on (0,1] 

Let ft = (0, 1], let T be the Borel a algebra of subsets of (0, 1], and let P be the probability measure 
on T such that P([a,b]) = b — a for < a < b < 1. For the following two sequences of random 
variables on (0, J 7 , P), find and sketch the distribution function of X n for typical n, and decide in 
which sense(s) (if any) each of the two sequences converges. 

(a) X n (uj) = nu> — [nuj\, where [x\ is the largest integer less than or equal to x. 

(b) X n (u>) = n 2 oj if < u> < 1/n, and X n {uj) = otherwise. 



66 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES 

2.8 Convergence of random variables on (0,1], version 2 

Let ft = (0, 1], let T be the Borel a algebra of subsets of (0, 1], and let P be the probability measure 

on T such that P([a, b]) = b — a for < a < b < 1. Determine in which of the four senses (a.s., p., 

m.s, d.), if any, each of the following three sequences of random variables converges. Justify your 

answers. 

to *n(uO = ^- 

(b) X n (u) = noj n . 

(c) X n {u) = wsin(27rno;). (Try at least for a heuristic justification.) 

2.9 On the maximum of a random walk with negative drift 

Let Xi,X2, ■ ■ ■ be independent, identically distributed random variables with mean i?[Xj] = — 1. 
Let Sq = 0, and for n > 1, let S n = X\ + • • • + X n . Let Z = m&x{S n : n > 0}. 

(a) Show that Z is well defined with probability one, and P{Z < +00} = 1. 

(b) Does there exist a finite constant L, depending only on the above assumptions, such that 
E[Z] < LI Justify your answer. (Hint: Z > maxjS'o, S\} = max{0,Xi}.) 

2.10 Convergence of a sequence of discrete random variables 

Let X n = X + (1/n) where P{X — i} = 1/6 for i = 1,2,3,4,5 or 6, and let F n denote the 
distribution function of X n . 

(a) For what values of x does F n {x) converge to F{x) as n tends to infinity? 

(b) At what values of x is Fx{x) continuous? 

(c) Does the sequence (X n ) converge in distribution to XI 

2.11 Convergence in distribution to a nonrandom limit 

Let (X n ,n > 1) be a sequence of random variables and let X be a random variable such that 
P{X = c} = 1 for some constant c. Prove that if linin^oo X n — X d., then linin^oo X n = X p. 
That is, prove that convergence in distribution to a constant implies convergence in probability to 
the same constant. 

2.12 Convergence of a minimum 

Let Ui, U2, ■ ■ ■ be a sequence of independent random variables, with each variable being uniformly 
distributed over the interval [0, 1], and let X n = mm{Ui, . . . , U n } for n > 1. 

(a) Determine in which of the senses (a.s., m.s., p., d.) the sequence (X n ) converges asm 00, 
and identify the limit, if any. Justify your answers. 

(b) Determine the value of the constant 9 so that the sequence (Y n ) defined by Y n = n e X n converges 
in distribution as n — > 00 to a nonzero limit, and identify the limit distribution. 

2.13 Convergence of a product 

Let U±, U2, ■ ■ ■ be a sequence of independent random variables, with each variable being uniformly 
distributed over the interval [0, 2], and let X n = U1U2 ■ ■ ■ U n for n > 1. 

(a) Determine in which of the senses (a.s., m.s., p., d.) the sequence (X n ) converges asm 00, 
and identify the limit, if any. Justify your answers. 

(b) Determine the value of the constant 6 so that the sequence (Y n ) defined by Y n = n? h\.(X n ) 
converges in distribution as n — > 00 to a nonzero limit. 
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2.14 Limits of functions of random variables 
Let g and h be functions denned as follows: 



-1 ifx<-l 
g(x) = ^ x if — 1 < x < 1 h{x) 

1 if x > 1 



-1 ifa;<0 
1 if x > 0. 



Thus, g represents a clipper and h represents a hard limiter. Suppose that (X n : n > 0) is a 
sequence of random variables, and that X is also a random variable, all on the same underlying 
probability space. Give a yes or no answer to each of the four questions below. For each yes answer, 
identify the limit and give a justification. For each no answer, give a counterexample. 

(a) If lim n ^oo X n = X a.s., then does lim n ^oo g(X n ) a.s. necessarily exist? 

(b) If linijj^oo X n = X m.s., then does linin^oo g(X n ) m.s. necessarily exist? 

(c) If lim n ^oo X n = X a.s., then does lim n ^oo h(X n ) a.s. necessarily exist? 

(d) If linijj^oo X n = X m.s., then does linin^^ h(X n ) m.s. necessarily exist? 

2.15 Sums of i.i.d. random variables, I 

A gambler repeatedly plays the following game: She bets one dollar and then there are three 
possible outcomes: she wins two dollars back with probability 0.4, she gets just the one dollar back 
with probability 0.1, and otherwise she gets nothing back. Roughly what is the probability that 
she is ahead after playing the game one hundred times? 

2.16 Sums of i.i.d. random variables, II 

Let Xi,X 2 , ... be independent random variable with P{Xi = 1} = P{Xi = —1} = 0.5. 

(a) Compute the characteristic function of the following random variables: X±, S n = X\ + • • ■ + X n , 
and V n = S n /y/n. 

(b) Find the pointwise limits of the characteristic functions of S n and V n as n — > oo . 

(c) In what sense(s), if any, do the sequences (S n ) and (V n ) converge? 

2.17 Sums of i.i.d. random variables, III 

Fix A > 0. For each integer n > A, let X\ :U , Xi^ n , . . . , X n<n be independent random variables such 
that P[X itn = 1] = X/n and P{X hn = 0} = 1 -'(A/n). Let Y n = X hn + X 2 , n + ••• + X n , n . 

(a) Compute the characteristic function of Y n for each n. 

(b) Find the pointwise limit of the characteristic functions as n — > oo tends. The limit is the 
characteristic function of what probability distribution? 

(c) In what sense(s), if any, does the sequence (Y n ) converge? 

2.18 On the growth of the maximum of n independent exponentials 

Suppose that Xi,X 2 , ■ ■ ■ are independent random variables, each with the exponential distribution 
with parameter A = 1. For n > 2, let Z n = max t i'---. ill . 

(a) Find a simple expression for the CDF of Z n . 

(b) Show that (Z n ) converges in distribution to a constant, and find the constant. (Note: It follows 
immediately that Z n converges in p. to the same constant. It can also be shown that (Z n ) converges 
in the a.s. and m.s. senses to the same constant.) 
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2.19 Normal approximation for quantization error 

Suppose each of 100 real numbers are rounded to the nearest integer and then added. Assume the 
individual roundoff errors are independent and uniformly distributed over the interval [—0.5,0.5]. 
Using the normal approximation suggested by the central limit theorem, find the approximate 
probability that the absolute value of the sum of the errors is greater than 5. 

2.20 Limit behavior of a stochastic dynamical system 

Let Wi, W2, ... be a sequence of independent, N(0, 0.5) random variables. Let Xq = 0, and define 
X\,X2, ■ ■ ■ recursively by X^+i = X'l + Wk- Determine in which of the senses (a.s., m.s., p., d.) 
the sequence (X n ) converges as n — > 00, and identify the limit, if any. Justify your answer. 

2.21 Applications of Jensen's inequality 

Explain how each of the inequalties below follows from Jensen's inequality. Specifically, identify 
the convex function and random variable used. 

(a) E[y] > e\X] > f° r a P os hi ve random variable X with finite mean. 

(b) .EfX 4 ] > i?[X 2 ] 2 , for a random variable X with finite second moment. 

(c) D(f\g) > 0, where / and g are positive probability densities on a set A, and D is the divergence 
distance defined by D(f\g) = J A f(x) In ^Mrdx. (The base used in the logarithm is not relevant.) 

2.22 Convergence analysis of successive averaging 

Let Ui,U2,--- be independent random variables, each uniformly distributed on the interval [0,1]. 
Let Xq = and X\ = 1, and for n > 1 let X n+ \ = (1 — U n )X n + U n X n -\. Note that given X n _\ 
and X n , the variable X n+ \ is uniformly distributed on the interval with endpoints X n _\ and X n . 

(a) Sketch a typical sample realization of the first few variables in the sequence. 

(b) Find E[X n ] for all n. 

(c) Show that X n converges in the a.s. sense as n goes to infinity. Explain your reasoning. (Hint: 
Let D n = \X n — X„_i|. Then D n+ \ = U n D n , and if m > n then \X m — X n \ < D n .) 

2.23 Understanding the Markov inequality 

Suppose X is a random variable with i?[X 4 ] = 30. 

(a) Derive an upper bound on P {\X\ > 10}. Show your work. 

(b) (Your bound in (a) must be the best possible in order to get both parts (a) and (b) correct). 
Find a distribution for X such that the bound you found in part (a) holds with equality. 

2.24 Mean square convergence of a random series 

The sum of infinitely many random variables, X\ + X2 + ■ ■ ■ is defined as the limit as n tends to 
infinity of the partial sums X\ + X2 + • • • + X n . The limit can be taken in the usual senses (in 
probability, in distribution, etc.). Suppose that the Xi are mutually independent with mean zero. 
Show that X\ + X2 + • • • exists in the mean square sense if and only if the sum of the variances, 
Var(Xi) + Var(X2) + • • • , is finite. (Hint: Apply the Cauchy criteria for mean square convergence.) 

2.25 Portfolio allocation 

Suppose that you are given one unit of money (for example, a million dollars). Each day you bet 
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a fraction a of it on a coin toss. If you win, you get double your money back, whereas if you lose, 
you get half of your money back. Let W n denote the wealth you have accumulated (or have left) 
after n days. Identify in what sense(s) the limit lim n ^oo W n exists, and when it does, identify the 
value of the limit 

(a) for a = (pure banking), 

(b) for a = 1 (pure betting), 

(c) for general a. 

(d) What value of a maximizes the expected wealth, E'fWfj]? Would you recommend using that 
value of a? 

(e) What value of a maximizes the long term growth rate of W n (Hint: Consider ln(W n ) and apply 
the LLN.) 

2.26 A large deviation 

Let Xi,X 2 , ... be independent, N(0,1) random variables. Find the constant b such that 

P{Xf + Xl + ... + Xl>2n} = exp(-n(6 + e„)) 
where e n — > as n — > oo. What is the numerical value of the approximation exp(— nb) if n = 100. 

2.27 Sums of independent Cauchy random variables 

Let X±,X2, ■ ■ ■ be independent, each with the standard Cauchy density function. The standard 
Cauchy density and its characteristic function are given by f(x) = —rjjr~2) an d <&(w) = exp( — \u\). 
Let S n = X 1 + X 2 + --- + X n . 

(a) Find the characteristic function of ^f for a constant 9. 

(b) Does -£ converge in distribution as in cxj? Justify your answer, and if the answer is yes, 
identify the limiting distribution. 

(c) Does — §- converge in distribution as n — > oo? Justify your answer, and if the answer is yes, 
identify the limiting distribution. 

(d) Does -^p= converge in distribution as n — > oo? Justify your answer, and if the answer is yes, 
identify the limiting distribution. 

2.28 A rapprochement between the central limit theorem and large deviations 

Let Xi,X 2 , ... be independent, identically distributed random variables with mean zero, variance 

a 2 , and probability density function /. Suppose the moment generating function M(9) is finite for 

9 in an open interval I containing zero. 

(a) Show that for 6 & I, (In M(6*))" is the variance for the "tilted" density function fg defined by 

fo{x) = f(x) exp(9x — lnM(0)). In particular, since (lnM(#))" is nonnegative, In M is a convex 

function. (The interchange of expectation and differentiation with respect to 9 can be justified for 

9 E. I. You needn't give details.) 

Let b > and let S n = X\ + • • • + X n for n any positive integer. By the central limit theorem, 

P{S n > b-^/n) — > Q{b/a) asm oo. An upper bound on the Q function is given by Q(u) = 

f°° -^=e~ s > 2 ds < f°° — i=e~ s > 2 ds = — ^e~ u ' 2 . This bound is a good approximation if u is 

moderately large. Thus, Q(b/a) ~ Z— e~ b < 2<T if b/a is moderately large. 
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(b) The large deviations upper bound yields P{S n > b^/n} < exp(— n£(b/\/n)). Identify the 
limit of the large deviations upper bound as n — > oo, and compare with the approximation given 
by the central limit theorem. (Hint: Approximate InM near zero by its second order Taylor's 
approximation . ) 

2.29 ChernofF bound for Gaussian and Poisson random variables 

(a) Let X have the N(fi, a 2 ) distribution. Find the optimized Chernoff bound on P{X > E[X] +c} 
for c > 0. 

(b) Let Y have the Poi(X) distribution. Find the optimized Chernoff bound on P{Y > E[Y] + c} 
for c > 0. 

(c) (The purpose of this problem is to highlight the similarity of the answers to parts (a) and (b).) 
Show that your answer to part (b) can be expressed as P{Y > E[Y] + c} < exp(— frV'(f)) f° r c > 0, 
where ip(u) = 2g(l + u)/u 2 , with g(s) = s(lns — 1) + 1. (Note: Y has variance A, so the essential 
difference between the normal and Poisson bounds is the tp term. The function tp is strictly positive 
and strictly decreasing on the interval [— l,+oo), with ip(— 1) = 2 and ^(0) = 1. Also, uip(u) is 
strictly increasing in u over the interval [— 1, +oo). ) 

2.30 Large deviations of a mixed sum 

Let Xi, X2, ■ ■ ■ have the Exp{l) distribution, and Y\, Y2, . . . have the Poi(l) distribution. Suppose 
all these random variables are mutually independent. Let < / < 1, and suppose S n = X\ + • • • + 
X n f + Y\ + ■ ■ ■ + Yri_f\ n . Define l(f, a) = liuin^^ - lnP{^- > a} for a > 1. Cramers theorem can 
be extended to show that l(f,a) can be computed by replacing the probability P{^f > a} by its 
optimized Chernoff bound. (For example, if / = 1/2, we simply view S n as the sum of the ^ i.i.d. 
random variables, X\ + Y\, . . . , Xn + Yn.) Compute /(/, a) for / £ {0, 3,3,1} and a = 4. 

2.31 Large deviation exponent for a mixture distribution 

Problem 2.30 concerns an example such that < / < 1 and S n is the sum of n independent random 
variables, such that a fraction / of the random variables have a CDF Fy and a fraction 1 — / have 
a CDF Fz- It is shown in the solutions that the large deviations exponent for -f is given by: 

1(a) = maxj^a - fM Y {0) - (1 - f)M z {6)} 

9 

where My(0) and Mz(9) are the log moment generating functions for Fy and Fz respectively. 
Consider the following variation. Let X\,X2, ■ ■ ■ ,X n be independent, and identically distributed, 
each with CDF given by Fx{c) = fFy(c) + (1 — f)Fz(c). Equivalently, each Xi can be generated 
by flipping a biased coin with probability of heads equal to /, and generating Xi using CDF Fy 
if heads shows and generating Xi with CDF Fz if tails shows. Let S n = X\ + • • • + X n , and let I 
denote the large deviations exponent for — . 

(a) Express the function / in terms of /, My, and Mz- 

(b) Determine which is true and give a proof: 1(a) < 1(a) for all a, or 1(a) > 1(a) for all a. Can you 
also offer an intuitive explanation? 

2.32 The limit of a sum of cumulative products of a sequence of uniform random 
variables 
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Let Ai,A2, ■ ■ ■ be a sequence of independent random variables, with 
P(Ai = 1) = P(Ai = \) = \ for all i. Let B k = A 1 ■ ■ ■ A k . 

(a) Does lim^oo Bk exist in the m.s. sense? Justify your anwswer. 

(b) Does linife^oo B k exist in the a.s. sense? Justify your anwswer. 

(c) Let S n = B\ + . . . + B n . Show that linim^^oo E[S m S n ] = ^, which implies that lim n ^oo S n 
exists in the m.s. sense. 

(d) Find the mean and variance of the limit random variable. 

(e) Does linin^oo S n exist in the a.s. sense? Justify your anwswer. 

2.33 * Distance measures (metrics) for random variables 

For random variables X and Y, define 

d!(X,Y) = E[\X-Y\/(1+\X-Y\)] 

d 2 (X, Y) = min{e > : F x (x + e) + e > F Y (x) and F Y (x + e) + e > F x (x) for all x} 

d 3 (X,Y) = (E[(X-Y) 2 ]) l /\ 

where in defining ds(X,Y) it is assumed that i?[X 2 ] and i?[F 2 ] are finite. 

(a) Show that d% is a metric for i = 1,2 or 3. Clearly di(X, X) = and di(X, Y) = di(Y,X). 

Verify in addition the triangle inequality. (The only other requirement of a metric is that 
di(X, Y) — only if X = Y. For this to be true we must think of the metric as being defined 
on equivalence classes of random variables.) 

(b) Let Xi,X 2 , ... be a sequence of random variables and let Y be a random variable. Show that 

X n converges to Y 

(i) in probability if and only if d±(X, Y) converges to zero, 

(ii) in distribution if and only if d 2 (X, Y) converges to zero, 

(iii) in the mean square sense if and only if d^(X, Y) converges to zero (assume E'fF 2 ] < oo). 

(Hint for (i): It helps to establish that 

di(X, Y) - e/(l + e) < P{\ X - Y |> e} < d x {X, Y)(l + e)/e. 
The "only if" part of (ii) is a little tricky. The metric d 2 is called the Levy metric. 

2.34 * Weak Law of Large Numbers 

Let X±,X 2 , ... be a sequence of random variables which are independent and identically distributed. 
Assume that i£[Xj] exists and is equal to zero for all i. If Var(Aj) is finite, then Chebychev's 
inequality easily establishes that {X\ + • • • + X n )/n converges in probability to zero. Taking that 
result as a starting point, show that the convergence still holds even if Var(Xj) is infinite. (Hint: 
Use "truncation" by defining Uk = X^I{\ Xk |> c} and Vk = Xkl{\ X^ |< c} for some constant c. 
E[\ Uk \] and S[Vfc] don't depend on k and converge to zero as c tends to infinity. You might also 
find the previous problem helpful. 
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2.35 * Completing the proof of Cramer's theorem 

Prove Theorem 2.5.1 without the assumption that the random variables are bounded. To begin, 
select a large constant C and let Xi denote a random variable with the conditional distribution of 
Xi given that \Xi\ < C. Let S n = X\ + • • • + X n and let I denote the large deviations exponent for 
Xi. Then 

S n 



{— >n\ >P{\X!\ <C} n P 



> n 

n 



One step is to show that 1(a) converges to 1(a) as C — > oo. It is equivalent to showing that if a 
pointwise monotonically increasing sequence of convex functions converges pointwise to a nonneg- 
ative convex function that is strictly positive outside some bounded set, then the minima of the 
convex functions converges to a nonnegative value. 



Chapter 3 

Random Vectors and Minimum Mean 
Squared Error Estimation 



The reader is encouraged to review the section on matrices in the appendix before reading this 
chapter. 



3.1 Basic definitions and properties 



A random vector X of dimension m has the form 



X 



V x m j 



where the X^s are random variables all on the same probability space. The expectation of X (also 
called the mean of X) is the vector EX (or £7[-X"]) defined by 



EX 



( EX X \ 

EX2 

\ EX m J 



Suppose Y is another random vector on the same probability space as X, with dimension n. The 
cross correlation matrix of X and Y is the m x n matrix £[jy T ], which has ij th entry i^LYjlj]. 
The cross covariance matrix of X and Y, denoted by Cov(X,Y), is the matrix with ij entry 
Cov(Xi,Yj). Note that the correlation matrix is the matrix of correlations, and the covariance 
matrix is the matrix of covariances. 

In the particular case that n = m and Y = X, the cross correlation matrix of X with itself, is 
simply called the correlation matrix of X, and is written as E[XX T ], and it has ij entry E[XiXj\. 

73 
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The cross covariance matrix of X with itself, Cov(X,X), has ij entry Cov(Xi,Xj). This matrix 
is called the covariance matrix of X, and it is also denoted by Cov(X). So the notations Cov(X) 
and Cov(X, X) are interchangeable. While the notation Cov(A') is more concise, the notation 
Cov(X, X) is more suggestive of the way the covariance matrix scales when X is multiplied by a 
constant. 

Elementary properties of expectation, correlation, and covariance for vectors follow immediately 
from similar properties for ordinary scalar random variables. These properties include the following 
(here A and C are nonrandom matrices and b and d are nonrandom vectors). 

1. E[AX + b] = AE[X] + b 

2. Cov(X,Y) = E[X(Y - EY) T ) = E[{X - EX)Y T ) = E[XY T ) - {EX){EY) T 

3. E[{AX){CY) T ) = AE[XY T )C T 

4. Cov{AX + b,CY + d) = ACov(X, Y)C T 

5. Cov(AX + b) = ACov(X)A T 

6. Cov(W + X,Y + Z) = Cov(W, Y) + Cov(W, Z) + Cov(X, Y) + Cov(X, Z) 

In particular, the second property above shows the close connection between correlation matrices 
and covariance matrices. In particular, if the mean vector of either X or Y is zero, then the cross 
correlation and cross covariance matrices are equal. 

Not every square matrix is a correlation matrix. For example, the diagonal elements must be 
nonnegative. Also, Schwarz's inequality (see Section 1.10) must be respected, so that |Cov(Xj, Xj)\ < 
y / Cov(Xj, Xi)Cov(Xj, Xj). Additional inequalities arise for consideration of three or more random 
variables at a time. Of course a square diagonal matrix is a correlation matrix if and only if its 
diagonal entries are nonnegative, because only vectors with independent entries need be considered. 
But if an m x m matrix is not diagonal, it is not a priori clear whether there are m random variables 
with all m{m + l)/2 correlations matching the entries of the matrix. The following proposition 
neatly resolves these issues. 

Proposition 3.1.1 Correlation matrices and covariance matrices are positive semidefinite. Con- 
versely, if K is a positive semidefinite matrix, then K is the covariance matrix and correlation 
matrix for some mean zero random vector X. 

Proof. If K is a correlation matrix, then K = E[XX T ] for some random vector X. Given any 
vector a, a T X is a scaler random variable, so 

a T Ka = E[a T XX T a] = E[(a T X)(X T a)\ = E[(a T X) 2 } > 0. 

Similarly, if K = Cov(X, X) then for any vector a, 

a T Ka = a T Cov(X,X)a = Cov(a T X,a T X) = Vai(a T X) > 0. 

The first part of the proposition is proved. 
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For the converse part, suppose that K is an arbitrary symmetric positive semidefinite matrix. 
Let Ai, . . . , A m and U be the corresponding set of eigenvalues and orthonormal matrix formed by 
the eigenvectors. (See Section 11.7 in the appendix.) Let Y\, . . . ,Y m be independent, mean 
random variables with Var(y) = Aj, and let Y be the random vector Y = (Yi, . . . ,Y m ) T . Then 
Cov(y, Y) = A, where A is the diagonal matrix with the Aj's on the diagonal. Let X = UY . Then 
EX = and 

Cov(X,X) = Cav(UY,UY) = UAU T = K. 

Therefore, K is both the covariance matrix and the correlation matrix of X. I 

The characteristic function &x of X is the function on M. m defined by 

®x(u) = E[eMJu T X)}. 

3.2 The orthogonality principle for minimum mean square error 
estimation 

Let X be a random variable with some known distribution. Suppose X is not observed but that 
we wish to estimate X. If we use a constant b to estimate X, the estimation error will be X — b. 
The mean square error (MSE) is E[(X — b) 2 ]. Since E[X — EX] = and EX — b is constant, 

E[(X - b) 2 } = E[((X - EX) + (EX - b)) 2 } 

= E[(X - EX) 2 + 2(X - EX)(EX - b) + (EX - b) 2 } 
= Var(X) + (EX - b) 2 . 

From this expression it is easy to see that the mean square error is minimized with respect to b if 
and only if b = EX. The minimum possible value is Vax(X). 

Random variables X and Y are called orthogonal if E[X7] = 0. Orthogonality is denoted by 

u x _l y." 

The essential fact E[X — EX] = is equivalent to the following condition: X — EX is orthogonal 
to constants: (X — EX) _L c for any constant c. Therefore, the choice of constant b yielding the 
minimum mean square error is the one that makes the error X — b orthogonal to all constants. This 
result is generalized by the orthogonality principle, stated next. 

Fix some probability space and let L 2 (£l, T , P) be the set of all random variables on the proba- 
bility space with finite second moments. Let X be a random variable in L 2 (0,^ r , P), and let V be 
a collection of random variables on the same probability space as X such that 

V.l V cL 2 (n,F,P) 

V.2 V is a linear class: If Z\ € V and Z<i £ V and oi, a<i are constants, then a\Z\ + 02^2 £ V 

V.3 V is closed in the mean square sense: If Z\, Z2, ... is a sequence of elements of V and if 
Z n — > Zoo m -s. for some random variable Zoo, then Z ro £ V. 
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That is, V is a closed linear subspace of L 2 {£t,T, P). The problem of interest is to find Z* in V to 
minimize the mean square error, E[(X — Z) 2 ], over all ZeV. That is, Z* is the random variable in 
V that is closest to X in the minimum mean square error (MMSE) sense. We call it the projection 
of X onto V and denote it as Ily(X). 

Estimating a random variable by a constant corresponds to the case that V is the set of constant 
random variables: the projection of a random variable X onto the set of constant random variables 
is EX. The orthogonality principle stated next is illustrated in Figure 3.1. 




Figure 3.1: Illustration of the orthogonality principle. 



Theorem 3.2.1 (The orthogonality principle) Let V be a closed, linear subspace of L 2 (0,^ r , P), 
and let X £ L 2 {£l,T,P), for some probability space (Q,T,P). 

(a) (Existence and uniqueness) There exists a unique element Z* (also denoted by ILy(X)) in V 
so that E[(X — Z*) 2 ] < E[(X — Z) 2 ] for all Z £ V. (Here, we consider two elements Z and 
Z' ofV to be the same if P{Z = Z'} = I). 

(b) (Characterization) Let W be a random variable. Then W = Z* if and only if the following 

two conditions hold: 

(i) W £ V 

(ii) (X-W)±Z for all Z in V. 

(c) (Error expression) The minimum mean square error (MMSE) is given by 

E[(X - Z*) 2 } = E[X 2 } - E[(Z*) 2 }. 

Proof. The proof of part (a) is given in an extra credit homework problem. The technical 
condition V.3 on V is essential for the proof of existence. Here parts (b) and (c) are proved. 

To establish the "if half of part (b), suppose W satisfies (i) and (ii) and let Z be an arbitrary 
element of V. Then W — Z £ V because V is a linear class. Therefore, {X — W) _L (W — Z), which 
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implies that 

E[{X-Z) 2 ] = E[(X - W + W - Z) 2 ] 

= E[(X - W) 2 + 2(X - W){W - Z) + (W - Z) 2 ] 
= E[(X - W) 2 ] + E[(W - Z) 2 }. 

Thus E[(X - W) 2 ) < E[(X - Z) 2 ). Since Z is an arbitrary element of V, it follows that W = Z*, 
and the "if half of (b) is proved. 

To establish the "only if half of part (b), note that Z* G V by the definition of Z*. Let Z G V 
and let eel. Then Z* + cZ G V, so that E[(X - (Z* + cZ)f\ > E[(X - Z*) 2 }. But 

E[(X - (Z* + cZ)) 2 } = E[(X - Z*) - cZ) 2 } = E[(X - Z*) 2 } - 2cE[(X - Z*)Z] + c 2 E[Z 2 }, 

so that 

-2cE[(X - Z*)Z] + c 2 E[Z 2 } > 0. (3.1) 

As a function of c the left side of (3.1) is a parabola with value zero at c = 0. Hence its derivative 
with respect to c at must be zero, which yields that {X — Z*) _L Z . The "only if half of (b) is 
proved. 

The expression of part (c) is proved as follows. Since X — Z* is orthogonal to all elements of 
V, including Z* itself, 

E[X 2 } = E[((X - Z*) + Z*) 2 } = E[(X - Z*) 2 } + E[(Z*) 2 }. 

This proves part (c). ■ 

The following propositions give some properties of the projection mapping Ily, with proofs 
based on the orthogonality principle. 

Proposition 3.2.2 (Linearity of projection) Suppose V is a closed linear subspace o/L 2 (0,^ 7 , P), 
X\ and X2 are in L 2 (Q,T,P), and a\ and 02 are constants. Then 

n v (oiXi + a 2 X 2 ) = aillv(Xi) + a 2 Il v (X 2 ). (3.2) 

Proof. By the characterization part of the orthogonality principle (part (b) of Theorem 3.2.1), 
the projection riy(ai^i + 02^2) is characterized by two properties. So, to prove (3.2), it suffices 
to show that aiH\; 1 (Xi) + a2Bv 2 (^2) satisfies these two properties. First, we must check that 
aiLTv! (X\) + a2lly 2 (X 2 ) G V. This follows immediately from the fact that riy(^i) G V, for i — 1, 2, 
and V is a linear subspace, so the first property is checked. Second, we must check that e X Z, where 
e = 01X1+02^2 — (ainv(^i)+a2nv(^2)) 5 and Z is an arbitrary element of V. Now e = a\ei+a 2 e 2 , 
where ei = X; t — Xly(Xi) for i = 1, 2, and a 1. Z for i = 1, 2. So E[eZ] = a\E\e\Z] + a 2 E[e 2 Z] = 0, 
or equivalently, e _L Z. Thus, the second property is also checked, and the proof is complete. ■ 
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Proposition 3.2.3 (Projections onto nested subspaces) Suppose Vi and V 2 are closed linear sub- 
spaces of L 2 (Q,T,P) suchthatV 2 C Vi- Then for any X £ L 2 (Q,T,P), Uy 2 (X) =n V2 Il Vl (X). (In 
words, the projection of X onto V 2 can be found by first projecting X onto Vi , and then projecting 
the result onto V 2 ■) Furthermore, 

e[(x - n V2 (x)) 2 ] = E[(x - n Vl (x)) 2 ] + E[(u Vl (x) - n V2 pQ) 2 ]. (3.3) 

In particular, E[(X - IIy 2 (X)) 2 ] > E[(X - n Vl (X)) 2 }. 

Proof. By the characterization part of the orthogonality principle (part (b) of Theorem 3.2.1), the 
projection IIy 2 (X) is characterized by two properties. So, to prove IIy 2 (X) = ny 2 IIy 1 (.X'), it suffices 
to show that Il\; 2 Il\; 1 (X) satisfies the two properties. First, we must check that ny 2 IIy 1 (.X") £ V2. 
This follows immediately from the fact that IIy 2 (X) maps into V2, so the first property is checked. 
Second, we must check that e _L Z, where e = X — I[y 2 IIy 1 (.X'), and Z is an arbitrary element of V 2 . 
Now e = e\ + e 2 , where e\ = X — IIy 1 (X) and e 2 = H\; 1 (X) — ny 2 iIy 1 (.X'). By the characterization 
of U\; 1 (X), ei is perpendicular to any random variable in Vi- In particular, e\ _L Z, because 
Z £ V2 C Vi- The characterization of the projection of IIy 1 (X) onto V 2 implies that e 2 _L Z. Since 
ej _L Z for i = 1,2, it follows that elZ. Thus, the second property is also checked, so it is proved 

that riy 2 (x) = n V2 n Vl (x). 

As mentioned above, e\ is perpendicular to any random variable in Vi, which implies that 
ei _L e 2 . Thus, E[e 2 ] = E[e 2 ] + E[e^\, which is equivalent to (3.3). Therefore, (3.3) is proved. The 
last inequality of the proposition follows, of course, from (3.3). The inequality is also equivalent to 
the inequality minyt/eV 2 E[(X — W) 2 ] > min^y e y 1 E[(X — W) 2 ], and this inequality is true because 
the minimum of a set of numbers cannot increase if more numbers are added to the set. I 

The following proposition is closely related to the use of linear innovations sequences, discussed 
in Sections 3.5 and 3.6. 

Proposition 3.2.4 (Projection onto the span of orthogonal subspaces) Suppose Vi and V 2 are 
closed linear subspaces of L 2 (Q, T, P) such that Vi -L V 2 , which means that E\Z\Z 2 \ = for any 
Z\ £ Vi and Z 2 £ V 2 . Let V = Vi ® V 2 = \Z\ + Z 2 : Z{ £ Vi} denote the span of V\ and V 2 . Then 
for any X £ L 2 (0,^ r , P), U\;(X) = IIy 1 (X) +IIy 2 (X). The minimum mean square error satisfies 

E[(X - U V (X)) 2 ] = E[X 2 ] - E[(n Vl (X)) 2 ] - E[(U V2 (X)) 2 }. 

Proof. The space V is also a closed linear subspace of L 2 (0,^ r , P) (see a starred homework 
problem). By the characterization part of the orthogonality principle (part (b) of Theorem 3.2.1), 
the projection Ily(X) is characterized by two properties. So to prove Ily(A A ) = IIy 1 (X) + IIy 2 (X), 
it suffices to show that IIy 1 (X) + IIy 2 (X) satisfies these two properties. First, we must check that 
IIy 1 (X) + IIy 2 (X) £ V. This follows immediately from the fact that IIy i (X) £ Vj, for i = 1,2, so the 
first property is checked. Second, we must check that e _L Z, where e = X — (Ily 1 (X) + IIy 2 (X)), 
and Z is an arbitrary element of V. Now any such Z can be written as Z = Z\ + Z 2 where Zi £ Vi 
for i = 1,2. Observe that IIy 2 (X) _L Z\ because IIy 2 (X) £ V 2 and Z\ £ Vi. Therefore, 

E[eZ x \ = E[(X-(n Vl (X)+n V2 (X))Z 1 ] 

= E[(x-n Vl (x))z 1 ] = o, 
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where the last equality follows from the characterization of IIy 1 (X). Thus, e 1 Zi, and similarly 
e _L Z2, so e 1 Z. Thus, the second property is also checked, so Ily(X) = Il\> l (X) + U\; 2 (X) is 
proved. 

Since U Vi (X) € V t for i = 1,2, U Vl (X) JL n Va (A"). Therefore, £[(n v (A")) 2 ] = E[(n Vl (X)) 2 } + 
E[(JI\> 2 (X)) 2 ], and the expression for the MMSE in the proposition follows from the error expression 
in the orthogonality principle. 



3.3 Conditional expectation and linear estimators 

In many applications, a random variable X is to be estimated based on observation of a random 
variable Y. Thus, an estimator is a function of Y. In applications, the two most frequently 
considered classes of functions of Y used in this context are essentially all functions, leading to the 
best unconstrained estimator, or all linear functions, leading to the best linear estimator. These 
two possibilities are discussed in this section. 

3.3.1 Conditional expectation as a projection 

Suppose a random variable X is to be estimated using an observed random vector Y of dimension 
m. Suppose _E[X 2 ] < +00. Consider the most general class of estimators based on Y, by setting 

V = {g(Y) : g : R m - R, E[g(Y) 2 } < +00}. (3.4) 

There is also the implicit condition that g is Borel measurable so that g(Y) is a random variable. 
The projection of X onto this class V is the unconstrained minimum mean square error (MMSE) 
estimator of X given Y. 

Let us first proceed to identify the optimal estimator by conditioning on the value of Y, thereby 
reducing this example to the estimation of a random variable by a constant, as discussed at the 
beginning of Section 3.2. For technical reasons we assume for now that X and Y have a joint pdf. 
Then, conditioning on Y, 



E[{X-g{Y)Y] = / E[(X-g(Y)y\Y = y}f Y (y)dy 

where 

/oo 
(x-g(y)) 2 f xlY (x\y)dx 
-00 

Since the mean is the MMSE estimator of a random variable among all constants, for each fixed y, 
the minimizing choice for g{y) is 



g*{y) = E[X\Y = y] = / xf x \ Y {x\ y)dx. (3.5) 
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Therefore, the optimal estimator in V is g* (Y) which, by definition, is equal to the random variable 
E[X\Y\. 

What does the orthogonality principle imply for this example? It implies that there exists an 
optimal estimator g*(Y) which is the unique element of V such that 

(X-g*(Y))±g(Y) 

for all g(Y) £ V. If X,Y have a joint pdf then we can check that E[X \ Y] satisfies the required 
condition. Indeed, 



E[(X-E[X\Y})g(Y)} 



(x - E[X | Y = y])g(y)f X \Y(x | y)f Y (y)dxdy 



(x-E[X\Y = y])f X \ Y (x | y)dx g(y)f Y (y)dy 



0, 



because the expression within the braces is zero. 

In summary, if X and Y have a joint pdf (and similarly if they have a joint pmf) then the 
MMSE estimator of X given Y is E[X \ Y\. Even if X and Y don't have a joint pdf or joint pmf, 
we define the conditional expectation E[X \ Y] to be the MMSE estimator of X given Y. By the 
orthogonality principle E[X | Y] exists as long as .E[Y 2 ] < oo, and it is the unique function of Y 
such that 



E[(X - E[X | Y])g(Y)] 







for all g(Y) in V. 

Estimation of a random variable has been discussed, but often we wish to estimate a random 
vector. A beauty of the MSE criteria is that it easily extends to estimation of random vectors, 
because the MSE for estimation of a random vector is the sum of the MSEs of the coordinates: 



E[\\X-g(Y)f] = Y.^i-^Y)) 2 ] 



Therefore, for most sets of estimators V typically encountered, finding the MMSE estimator of a 
random vector X decomposes into finding the MMSE estimators of the coordinates of X separately. 
Suppose a random vector X is to be estimated using estimators of the form g(Y), where here g 
maps W 1 into R m . Assume i?[||Y|| 2 ] < +oo and seek an estimator to minimize the MSE. As seen 
above, the MMSE estimator for each coordinate Xi is i?LYj|Y], which is also the projection of Xi 
onto the set of unconstrained estimators based on Y, defined in (3.4). So the optimal estimator 
g*(Y) of the entire vector X is given by 



9*(Y) 



E[X | Y] 



( E[Xi I Y] \ 

E[X 2 | Y) 

V E[X m | Y] / 
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Let the estimation error be denoted by e, e = X — E[X \ Y\. (Even though e is a random vector 
we use lower case for it for an obvious reason.) 

The mean of the error is given by Ee = 0. As for the covariance of the error, note that 
E[Xj | Y] is in V for each j, so e^ _L E[Xj \ Y] for each i,j. Since Eei = 0, it follows that 
Cov(ej, E[Xj | Y]) = for all i,j. Equivalently, Cov(e, E[X \ Y]) = 0. Using this and the fact 
X = E[X \Y] + e yields 

Cov(X) = Cov(E[X \Y] + e) 

= Cov(E[X | Y]) + Cov(e) + Cov(E[X\Y],e) + Cov(e, E[X\Y\) 
= Cov(E[X \Y}) + Cov(e) 

Thus, Cov(e) = Cov(X) - Cov(E[X \Y}). 

In practice, computation of E[X \ Y] (for example, using (3.5) in case a joint pdf exists) may 
be too complex or may require more information about the joint distribution of X and Y than 
is available. For both of these reasons, it is worthwhile to consider classes of estimators that are 
constrained to smaller sets of functions of the observations. A widely used set is the set of all linear 
functions, leading to linear estimators, described next. 

3.3.2 Linear estimators 

Let X and Y be random vectors with S[||X|| 2 ] < +oo and i?[||y|| 2 ] < +oo. Seek estimators of 
the form AY + b to minimize the MSE. Such estimators are called linear estimators because each 
coordinate of AY + b is a linear combination of Y\, Y2, . . . , Y m and 1. Here "1" stands for the 
random variable that is always equal to 1. 

To identify the optimal linear estimator we shall apply the orthogonality principle for each 
coordinate of X with 

V = {c + ciYi + C2Y2 + . . . + c n Y n : co, ci, . . . , c„ € R} 

Let e denote the estimation error e = X — (AY + b). We must select A and b so that e, _L Z for all 
Z G V. Equivalently, we must select A and b so that 

ej _L 1 all i 
ei±Yj all i,j. 

The condition e^ _L 1, which means Ea = 0, implies that E[eiYj] = Cov(ei,Yj). Thus, the 
required orthogonality conditions on A and b become Ee = and Cov(e, Y) = 0. The condition 
Ee = requires that b = EX — AEY, so we can restrict our attention to estimators of the 
form EX + A(Y - EY), so that e = X - EX - A(Y - EY). The condition Cov(e,F) = 
becomes Cov(X, Y) — ^4Cov(Y, Y) = 0. If Cov(y, Y) is not singular, then A must be given by 
A = Cov(X,Y)Cov(Y,Y)~ 1 . In this case the optimal linear estimator, denoted by E[X \ Y], is 
given by 

E[X I Y] = E[X] + Cov(X, F)Cov(F, Y)~ l (Y - EY) (3.6) 
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Proceeding as in the case of unconstrained estimators of a random vector, we find that the covariance 
of the error vector satisfies 

Cov(e) = Cov(X) - Cov{E[X \ Y\) 

which by (3.6) yields 

Cov(e) = Cov(X) - Cov(X, F)Cov(F, Y)~ 1 Coy(Y, X). (3.7) 

3.3.3 Discussion of the estimators 

As seen above, the expectation S[X], the MMSE linear estimator i£[X|Y|, and the conditional 
expectation £7[.XjY], are all instances of projection mappings IIv, for V consisting of constants, 
linear estimators based on Y, or unconstrained estimators based on Y, respectively. Hence, the 
orthogonality principle, and Propositions 3.2.2-3.2.4 all apply to these estimators. 

Proposition 3.2.2 implies that these estimators are linear functions of X. In particular, 
E[a x Xi + a 2 X 2 \Y] = aiE[Xi\Y) + a 2 E[X 2 \Y], and the same is true with "£"' replaced by "E." 

Proposition 3.2.3, regarding projections onto nested subspaces, implies an ordering of the mean 
square errors: 

E[(X - E[X | Y}) 2 } < E[(X - E[X \ Y}) 2 } < Var(X). 

Furthermore, it implies that the best linear estimator of X based on Y is equal to the best linear 
estimator of the estimator i?LX"|Y]: that is, i?LY|Y] = E?[i?[X|Y]|Y]. It follows, in particular, that 
E[X\Y] = E[X\Y] if and only if E[X\Y] has the linear form, AX + b. Similarly, E[X], the best 
constant estimator of X, is also the best constant estimator of £7[.XjY] or of i£[X|Y]. That is, 
E[X] = E[E[X\Y]] = E[E[X\Y}}. In fact, E[X] = E[E[E[X\Y]\Y}}. 

Proposition 3.2.3 also implies relations among estimators based on different sets of observations. 
For example, suppose X is to be estimated and Y\ and Y 2 are both possible observations. The space 
of unrestricted estimators based on Y\ alone is a subspace of the space of unrestricted estimators 
based on both Y-y and Y 2 . Therefore, Proposition 3.2.3 implies that E[E[X\Yi,Y 2 )\Yi) = E[X\Yi], a 
property that is sometimes called the tower property of conditional expectation. The same relation 
holds true for the same reason for the best linear estimators: i£[i£[.X'|Yi, Y^llY] = i?[X|Yi]. 

Example 3.3.1 Let X, Y be jointly continuous random variables with the pdf 

x + y < x,y < 1 



fxyU - !j) i else 

Let us find E[X \ Y] and E[X | Y\. To find E[X | Y] we first identify f Y (y) and f x \y(x\y)- 

Therefore, fx\y( x I y) is defined only for < y < 1, and for such y it is given by 
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f q±y <x<l 

i^\y) = { ^ - els ; 

So for < y < 1, 

2 + 3y 



E[X\Y = y] = f xf xlY (x | y)dx 
Jo 



3 + 6y' 



Therefore, E[X \ Y] = §±§£. To find £[X | Y] we compute £X = EY = ^, Var(y) = ^_ anc i 

Cov(x,F) = -^ so £[x | y] = ^ - ^(y - ^). 



Example 3.3.2 Suppose that Y = XU, where X and U are independent random variables, X has 
the Rayleigh density 

f _z_ -x 2 /2a 2 -> n 

^ v ; \ else 

and [/ is uniformly distributed on the interval [0,1]. We find E[X \ Y] and E'fX | Y]. To compute 
£[X | Y] we find 

Jo ° 2 ^\^J-ooV2^ V2 

EY = EXEU = -J- 
2 V 2 

£[X 2 ] = 2cr 2 

Var(y) = E[Y 2 } - E[Y} 2 = E[X 2 ]E[U 2 } - E[X] 2 E[U} 2 = a 2 (--- 

\ 3 8 

Cov(X, y) = E[U]E[X 2 } - E[U]E[X} 2 = -Var(X) = a 2 (l - - 



Thus 



7T , ( ! ~ f ) / . rr / 7T 

2 7T 



Mx I yi = aW- + ^ — ±My 

L I J V/ 2 (f-f) V 2V2 



To find i?LY | y] we first find the joint density and then the conditional density. Now 
fxr(x,y) = fx(x)f Y \x(y\x) 



J_ e -Z 2 /2<7 2 Q<y< X 

else 



f f°° J_ e -^/2^ dx _ V27TQ (y\ > Q 
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where Q is the complementary CDF for the standard normal distribution. So for y > 

/oo 
xfxY(x,y)dx/Mv) 
-oo 



roo x_ -x 2 /2a 2 j , 2 In 

J y ff2 e aa; <rexp(-^/2 



a 2 ) 



Thus, 



1=0(1) V^lf) 



v^Q(£) 



Example 3.3.3 Suppose that Y is a random variable and / is a Borel measurable function such 
that E[f(Y) 2 } < oo. Let us show that E[f(Y)\Y] = fCY). By definition, E[f(Y)\Y] is the random 
variable of the form g(Y) which is closest to f(Y) in the mean square sense. If we take g(Y) = f(Y), 
then the mean square error is zero. No other estimator can have a smaller mean square error. Thus, 
£?[/(y)|y] = f(Y). Similarly, if Y is a random vector with E'fljyll 2 ] < oo, and if A is a matrix and 
b a vector, then E[AY + b\Y] = AY + b. 



3.4 Joint Gaussian distribution and Gaussian random vectors 

Recall that a random variable X is Gaussian (or normal) with mean /j, and variance a 2 > if X 
has pdf 

fx{x) = y= - e 2.2 . 

As a degenerate case, we say X is Gaussian with mean jjl and variance if P{X = /j,} = 1. 
Equivalently, X is Gaussian with mean // and variance o -2 > if its characteristic function is given 
by 

* , \ ( u2(j2 ■ \ 
9 x [u) = exp I ^Jimj. 

Lemma 3.4.1 Suppose X±,X2, ■ ■ ■ ,X n are independent Gaussian random variables. Then any 
linear combination a\X\ + • • • + a n X n is a Gaussian random variable. 

Proof. By an induction argument on n, it is sufficient to prove the lemma for n = 2. Also, if X 
is a Gaussian random variable, then so is aX for any constant a, so we can assume without loss 
of generality that a\ = a<i = 1. It remains to prove that if X\ and X2 are independent Gaussian 
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random variables, then the sum X = X\ + X2 is also a Gaussian random variable. Let jjli = E\Xj\ 
and a 2 = Var(JQ). Then the characteristic function of X is given by 

$x(u) = E[e ]uX ] = E[e juXl e juX2 ] = E[e juXl ]E[e juX2 } 

( u 2 a 2 \ ( u 2 u 2 \ ( u 2 a 2 
= exp I — +j[nuj exp I — +jfi 2 uj = exp I — +jfiu 

where \x = [i\ + /^2 and <r 2 = o^ + <r 2 . Thus, X is a Af(/z, a 2 ) random variable. I 

Let {Xi : i £ 7) be a collection of random variables indexed by some set /, which possibly has 
infinite cardinality. A finite linear combination of (Xi : i £ /) is a random variable of the form 

aiX^ + a 2 Xi 2 + • • • + a n Xi n 

where n is finite, i^ £ 7 for each fc, and a^ £ M for each fc. 

Definition 3.4.2 A collection (Xi : i £ I) of random variables has a joint Gaussian distribution 
(and the random variables Xi : i £ I themselves are said to be jointly Gaussian,) if every finite 
linear combination of (Xi : i £ I) is a Gaussian random variable. A random vector X is called 
a Gaussian random vector if its coordinate random variables are jointly Gaussian. A collection of 
random vectors is said to have a joint Gaussian distribution if all of the coordinate random variables 
of all of the vectors are jointly Gaussian. 

We write that X is a N(fi, K) random vector if X is a Gaussian random vector with mean vector 
fj, and covariance matrix K. 

Proposition 3.4.3 (a) If (Xi : i £ /) has a joint Gaussian distribution, then each of the random 
variables itself is Gaussian. 

(b) If the random variables Xi : i £ I are each Gaussian and if they are independent, which 

means that X; Ll , JQ 2 , . . . , Xi n are independent for any finite number of indices i\, i 2 , ■ ■ ■ , i n , 
then (Xi : i £ I) has a joint Gaussian distribution. 

(c) (Preservation of joint Gaussian property under linear combinations and limits) Suppose 

(Xi : i £ I) has a joint Gaussian distribution. Let (Yj : j £ J) denote a collection of random 
variables such that each Yj is a finite linear combination of (Xi : i £ I), and let (Z^ : k £ K) 
denote a set of random variables such that each Z^ is a limit in probability (or in the m.s. or 
a.s. senses) of a sequence from (Yj : j £ J). Then (Yj : j £ J) and (Z^ : k £ K) each have a 
joint Gaussian distribution. 

(d) (Alternative version of (c)) Suppose (Xi : i £ I) has a joint Gaussian distribution. Let Z 
denote the smallest set of random variables that contains (Xi : i £ I), is a linear class, and 
is closed under taking limits in probability. Then Z has a joint Gaussian distribution. 

(d) The characteristic function of a N(pL,K) random vector is given by &x(u) = E[e^ u x ] = 

ju T fj,-^u T Ku 
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(e) If X is a N(/j,,K) random vector and K is a diagonal matrix (i.e. cov(Xi,Xj) = for i / j, 

or equivalently , the coordinates of X are uncorrelated) then the coordinates Xi, . . . ,X m are 
independent. 

(f) A iV(/x, K) random vector X such that K is nonsingular has a pdf given by 

fx(x) = - — — i exp I . (3.8) 

(2tt)^\K\2 \ 2 / 

Any random vector X such that Cov{X) is singular does not have a pdf. 

(g) If X and Y are jointly Gaussian vectors, then they are independent if and only if Cov(X,Y) = 

0. 

Proof, (a) Supppose (Xi : i € I) has a joint Gaussian distribution, so that all finite linear 
combinations of the X^s are Gaussian random variables. Each Xi for i £ / is itself a finite lin- 
ear combination of all the variables (with only one term). So each Xi is a Gaussian random variable. 

(b) Suppose the variables Xj : i £ / are mutually independent, and each is Gaussian. Then any 
finite linear combination of (Xi : i £ /) is the sum of finitely many independent Gaussian random 
variables (by Lemma 3.4.1), and is hence also a Gaussian random variable. So (Xi : i €. I) has a 
joint Gaussian distribution. 

(c) Suppose the hypotheses of (c) are true. Let V be a finite linear combination of (Yj : j £ J) : 
V = b\Yj 1 + &2^j 2 + ' ' ' + b n Yj n . Each Yj is a finite linear combination of (Xi : i £ I), so V can be 
written as a finite linear combination of (Xi : i £ /): 

V = 6i(onXj 11 + auXi 12 + • • • + oifc 1 Xj lfci ) + • • • + b n (a n \Xi nl + • • • + a n k n Xi nkn ). 

Therefore V is thus a Gaussian random variable. Thus, any finite linear combination of (Yj : j £ J) 
is Gaussian, so that (Yj : j £ J) has a joint Gaussian distribution. 

Let W be a finite linear combination of (Zj, : k £ K): W = <X\Z^ +• • --\-a m Zk m . By assumption, 

for 1 < I < m, there is a sequence (ji t7l : n > 1) of indices from J such that Yj l -4 Z^ as n — > oo. 
Let W n = aiYj-j „ + ••• + «m^j m , n - Each W n is a Gaussian random variable, because it is a finite 
linear combination of (Yj : j £ J). Also, 

TO 

|w-w n |<5>|^-y J; ,j. (3.9) 

1=1 

Since each term on the right-hand side of (3.9) converges to zero in probability, it follows that 
W n — > W as n — > oo. Since limits in probability of Gaussian random variables are also Gaussian 
random variables (Proposition 2.1.16), it follows that W is a Gaussian random variable. Thus, an 
arbitrary finite linear combination W of (Z^ : k £ K) is Gaussian, so, by definition, (Z\~ : k £ K) 
has a joint Gaussian distribution. 
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(c ; ) Suppose (Xi : i £ I) has a joint Gaussian distribution. Using the notation of part (c), let 
(Yi : i £ I) denote the set of all finite linear combinations of (Xi : i £ 7) and let (Z^ : k £ K ) denote 
the set of all random variables that are limits in probability of random variables in (Yf, i £ 7). We 
will show that Z = (Z/~ : k £ K), which together with part (c) already proved, will establish (c ; ). 
We begin by establishing that (Z\. : k £ K) satisfies the three properties required of Z : 

(i) (Z\. : k £ K) contains (Xi : i £ 7). 

(ii) (Zk '■ k £ K) is a linear class 

(iii) (Zk : k £ K) is closed under taking limits in probability 

Property (i) follows from the fact that for any i £ 7, the random variable Xi o is trivially a finite 
linear combination of (Xi : i £ 7), and it is trivially the limit in probability of the sequence with all 
entries equal to itself. Property (ii) is true because a linear combination of the form a\Z^ x + ^Z^ 
is the limit in probability of a sequence of random variables of the form a\Yj n 1 + 02^ n2 ) and, 
since (Yj : j £ J) is a linear class, a{Yj n x + a?Yj n is a random variable from (Yj : j £ J) for 

each n. To prove (iii), suppose Z^ n — > Z M as n — > oo for some sequence k\,k2,... from K. By 
passing to a subsequence if necessary, it can be assumed that P{|Zoo — Z^ n \ > 2~( n+l >} < 2~^ n+l > 
for all n > 1. Since each Z^ n is the limit in probability of a sequence of random variables from 
(Yj : j £ J), for each n there is a j n £ J so that P{|Z fcn - Y jn \ > 2~^ n+ ^} < 2"( n+1 ). Since 

\Zoo - Y jn \ < (Zoo - Z kn \ + \Z kn - Y jn \, it follows that PflZ*, - F jn | > 2"™} < 2"™. So F jn ^> Z^. 
Therefore, Z^ is a random variable in (Z k : k £ TC), so (Z^ : k £ 7C) is closed under convergence 
in probability. In summary, (Z k : k £ K) has properties (i)-(iii). Any set of random variables 
with these three properties must contain (Yj : j £ J), and hence must contain (Z k : k £ K). 
So (Zf- : A; £ K) is indeed the smallest set of random variables with properties (i)-(iii). That is, 
(Z k '■ k £ K) = Z, as claimed. 

(d) Let X be a N(/i, K) random vector. Then for any vector u with the same dimension as X, 
the random variable u T X is Gaussian with mean u T fi and variance given by 

Vai(u T X) = Cov(u T X,u T X) = u T Ku. 

Thus, we already know the characteristic function of u T X. But the characteristic function of the 
vector X evaluated at u is the characteristic function of u T X evaluated at 1 : 

$ x (u) = E[e juTx ] = E[e^ uTx) ) = $ u t x (1) = e ]uT ^ uT Ku , 

which establishes part (d) of the proposition. 

(e) If X is a N(/j,, K) random vector and K is a diagonal matrix, then 

m / k-u 2 \ 

&x(u) = ]Jexp (j Ul fii - -^ j = Y[$i(ui) 

where ku denotes the i th diagonal element of K, and $j is the characteristic function of a N(/j,i, ku) 
random variable. By uniqueness of joint characteristic functions, it follows that Xi, . . . ,X m are 
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independent random variables. 

(f) Let X be a N(fi, K) random vector. Since K is positive semidefinite it can be written as 
K = UAU T where U is orthonormal (so UU T = U T U = I) and A is a diagonal matrix with the 
nonnegative eigenvalues Ai, A2, • • • , A m of K along the diagonal. (See Section 11.7 of the appendix.) 
Let Y = U T (X — n). Then Y is a Gaussian vector with mean and covariance matrix given by 
Cov(F,F) = Cov(U T X, U T X) = U T KU = A. In summary, we have X = UY + /x, and Y is a 
vector of independent Gaussian random variables, the i one being N(0, Ai). Suppose further that 
K is nonsingular, meaning det(i^) / 0. Since det(K) = A1A2 • • • A m this implies that Aj > for 
each i, so that Y has the joint pdf 



,,, ft 1 ( v? \ 1 ( y T A-'< 

Since | det(C/")| = 1 and C/A _1 C/ T = K _1 , the joint pdf for the N(fi,K) random vector X is given 
by 

f ( \ t (TT T( ^^ ! ( (X~ H) T K- I (x- fx) 

fx(x) = fy{U [x-n)) = - — , 1 exp I 

Now suppose, instead, that X is any random vector with some mean \x and a singular covariance 
matrix K. That means that det K = 0, or equivalently that Aj = for one of the eigenvalues of K, or 
equivalently, that there is a vector a such that a T Ka = (such an a is an eigenvector of K for eigen- 
value zero). But then = a T Ka = a T Cov(X, X)a = Cov(a X, a T X) = Var(a T X). Therefore, 
P{a T X = a T fi} = 1. That is, with probability one, X is in the subspace {x £ M m : a T (x — fi) = 0}. 
Therefore, X does not have a pdf. 

(g) Suppose X and Y" are jointly Gaussian vectors and uncorrelated (so Cov(X, Y) = 0.) Let Z 
denote the dimension m + n vector with coordinates X\, . . . , X m , Y\, . . . , Y n . Since Cov(A", Y) = 0, 
the covariance matrix of Z is block diagonal: 



Cov(Z) 



Cov(AT) 

Cov(F) 



Therefore, for u G M m and v G R n , 



= $x( - u)3 ) y(w). 

Such factorization implies that X and Y are independent. The if part of part (f) is proved. 
Conversely, if X and Y are jointly Gaussian and independent of each other, then the characteristic 
function of the joint density must factor, which implies that Cov(Z) is block diagonal as above. 
That is, Cov(A:, Y) = 0. ■ 
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Recall that in general, if X and Y are two random vectors on the same probability space, then 
the mean square error for the MMSE linear estimator _E[X|Y| is greater than or equal to the mean 
square error for the best unconstrained estimator, i?LX"|Y|. The tradeoff, however, is that £?[X|Y"| 
can be much more difficult to compute than i?LY|Y|, which is determined entirely by first and 
second moments. As shown in the next proposition, if X and Y are jointly Gaussian, the two 
estimators coincide. That is, the MMSE unconstrained estimator of Y is linear. We also know that 
£?LY|Y = y] is the mean of the conditional mean of X given Y — y. The proposition identifies not 
only the conditional mean, but the entire conditional distribution of X given Y = y, for the case 
X and Y are jointly Gaussian. 

Proposition 3.4.4 Let X and Y be jointly Gaussian vectors. Given Y = y, the conditional 
distribution of X is N{E[X\Y = y], Cov{e)). In particular, the conditional mean E[X\Y = y] is 
equal to E[X\Y = y\. That is, if X and Y are jointly Gaussian, then E[X\Y] = E[X\Y]. 
If Cov{Y) is nonsingular, 

E[X\Y = y] = E[X\Y = y] = EX + Cov(X,Y)Cov(Yy 1 (y - E[Y\) (3.10) 

Cov(e) = Cov(X)- Cov(X,Y)Cov(Y)- 1 Cov(Y,X), (3.11) 

and if Cov(e) is nonsingular, 
f x \ Y (x\y) = - — - exp (-\(x- E[X\Y = yjf Cov{e)-\x - E[X\Y = y})) . (3.12) 



(27r)^|CW(e)|2 



2 



Proof. Consider the MMSE linear estimator i2[X|Y] of X given Y, and let e denote the 
corresponding error vector: e = X — E[X\Y]. Recall that, by the orthogonality principle, Ee = 
and Cov(e, Y) = 0. Since Y and e are obtained from X and Y by linear transformations, they are 
jointly Gaussian. Since Cov(e, Y) = 0, the random vectors e and Y are also independent. For the 
next part of the proof, the reader should keep in mind that if a is a deterministic vector of some 
dimension m, and Z is a N(0, K) random vector, for a matrix K that is not a function of a, then 
Z + a has the N(a, K) distribution. 

Focus on the following rearrangement of the definition of e: 

X = e + E[X\Y}. (3.13) 

(Basically, the whole proof of the proposition hinges on (3.13).) Since _E[X|y] is a function of Y 
and since e is independent of Y with distribution N(0, Cov(e)), the following key observation can 
be made. Given Y = y, the conditional distribution of e is the N(0, Cov(e)) distribution, which 
does not depend on y, while i?[X|Y = y] is completely determined by y. So, given Y = y, X can 
be viewed as the sum of the N(0, Cov(e)) vector e and the determined vector 2£[.X"|Y = y\. So the 
conditional distribution of X given Y = y is N{E[X\Y = y],Cov(e)). In particular, i?[X|Y = y], 
which in general is the mean of the conditional distribution of X given Y = y, is therefore the 
mean of the N(E[X\Y = y], Cov(e)) distribution. Hence E[X\Y = y] = E[X\Y = y\. Since this is 
true for all y, E[X\Y] = E[X\Y\. 
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Equations (3.10) and (3.11), respectively, are just the equations (3.6) and (3.7) derived for the 
MMSE linear estimator, i?[X|Y], and its associated covariance of error. Equation (3.12) is just the 
formula (3.8) for the pdf of a N(fx, K) vector, with jj, = E[X\Y = y] and K = Cov(e). 



Example 3.4.5 Suppose X and Y are jointly Gaussian mean zero random variables such that the 

vector I ) has covariance matrix I I . Let us find simple expressions for the two random 

variables .ELY^Y] and P(X > c\Y). Note that if W is a random variable with the N(/j,,a 2 ) 
distribution, then .EfVY 2 ] = fi 2 + a 2 and PjVF > c} = Q( — — ), where Q is the standard Gaussian 
complementary CDF. The idea is to apply these facts to the conditional distribution of X given Y. 

Given Y = y, the conditional distribution of X is N( v =■ ' '- y, Cov(X) v *■ ' ' ), or iV(|, 3). 

Therefore, E[X 2 \Y = y] = (|) 2 + 3 and P(X > c\Y = y) = Q( c ~% /3) ). Applying these two 



f) 2 + 3andP(X^ ,,,,->„ - 



functions to the random variable Y yields E[X 2 \Y] = (^) 2 + 3 and P(X > c\Y) = Q(^±-^-' 



3.5 Linear Innovations Sequences 

Let X, Yi, . . . , Y n be random vectors with finite second moments, all on the same probability space. 
In general, computation of the joint projection i?LY|Yi, . . . , Y n ] is considerably more complicated 
than computation of the individual projections .E[.X"|Yi], because it requires inversion of the covari- 
ance matrix of all the Y's. However, if E\Yj\ = for all i and E[YiYj ] = for i / j (i.e., all 
coordinates of Yi are orthogonal to constants and to all coordinates of Yj for i / j), then 

n 

E[X\Y 1 ,...,Y n ] = X + Y,E[X -X\Yi\, (3.14) 

i=i 

where we write X for EX. The orthogonality principle can be used to prove (3.14) as follows. It 
suffices to prove that the right side of (3.14) satisfies the two properties that together characterize 
the left side of (3.14). First, the right side is a linear combination of 1, Yi, . . . , Y n . Secondly, let e 
denote the error when the right side of (3.14) is used to estimate X: 

n 

e = X- X-^2,E[X- X\Yi\. 
i=i 

It must be shown that E[e{Y^ci + Y 2 T C2 + • • • + Y r fc n + b)] = for any constant vectors ci, . . . , c n 
and constant b. It is enough to show that E[e] = and E[eY?] = for all j. But E[X — X\Yi\ has 
the form .BjYj, because X — X and Yj have mean zero. Thus, E[e] = 0. Furthermore, 

E\eYj\ = E[(X- E[X\Y 3 }) Y/] - ]T E[B t Y t Y?}. 
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Each term on the right side of this equation is zero, so E\eYj ] = 0, and (3.14) is proved. 

If 1, Y\, Y2, . . . ,Y n have finite second moments but are not orthogonal, then (3.14) doesn't di- 
rectly apply. However, by orthogonalizing this sequence we can obtain a sequence 1, Yi, Y2, . . . , Y n 
that can be used instead. Let Y\ = Y\ — E\Y\\, and for k > 2 let 

Y k = Y k -E[Y k \Y l ,...,Y k _ l ). (3.15) 

Then E\Yj\ = for all i and E[YiY^ ] = for i / j. In addition, by induction on k, we can prove 
that the set of all random variables obtained by linear transformation of 1 , Y\ , . . . , Y k is equal to 
the set of all random variables obtained by linear transformation of 1, Yi, . . . , Y k . 
Thus, for any random variable X with finite second moments, 

n 

E[X\Y u ...,Y n ] = E[X\Y l ,...,Y n ]=X + Y,E[X-X\Y t ] 

i=i 
- | " E[XYj]Y t 

h Emn 

Moreover, this same result can be used to compute the innovations sequence recursively: Y\ = 
Yi-E\Yi], and 

Y k = Y k - E[Y k ] - Y E ^ Yk 3 Y% k>2. 
The sequence Y\, Y2, . . . , Y n is called the linear innovations sequence for Yi, Y2, . . . , Y n . 

3.6 Discrete-time Kalman filtering 

Kalman filtering is a state-space approach to the problem of estimating one random sequence 
from another. Recursive equations are found that are useful in many real-time applications. For 
notational convenience, because there are so many matrices in this section, lower case letters are 
used for random vectors. All the random variables involved are assumed to have finite second 
moments. The state sequence xo, xi, . . ., is to be estimated from an observed sequence yo,yi, ■ ■ ■■ 
These sequences of random vectors are assumed to satisfy the following state and observation 
equations. 

State: x k+ \ = F k x k + w k k > 
Observation: y k = H k x k + v k k > 0. 

It is assumed that 

• xq, vq, vi, . . . ,wq, wi, . . . are pairwise uncorrelated. 

• Ex = xq, Cov(a;o) = Pq, Ew k = 0, Cov(w k ) = Q k , Ev k = 0, Cov(w fc ) = R k . 
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• Fk, H k , Qk, Rk f° r k > 0; Pq are known matrices. 

• xq is a known vector. 

See Figure 3.2 for a block diagram of the state and observation equations. The evolution of the 



w 
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Figure 3.2: Block diagram of the state and observations equations. 

state sequence xo, x\, . . . is driven by the random vectors wo, w\, . . ., while the random vectors vo, 
v\, . . . , represent observation noise. 

Let x~k = E[xk] and P^ = Cov(xfc). These quantities are recursively determined for k > 1 by 



x fc +i = F k x k and P k +i = F k P k F k + Q k , 



(3.16) 



where the initial conditions xq and Pq are given as part of the state model. The idea of the Kalman 
filter equations is to recursively compute conditional expectations in a similar way. 

Let y = (yoj Vi-, ■ ■ ■ , 2/fc) represent the observations up to time k. Define for nonnegative integers 
*, 3 

Xi\j = E[xi\y J ] 

and the associated covariance of error matrices 



J i\j 



Cov(xi 



c i\j, 



The goal is to compute x k +i\ k for k > 0. The Kalman filter equations will first be stated, then 
briefly discussed, and then derived. The Kalman filter equations are given by 



x fc+ i| fc = [F k - K k H k ] x k \ k _i + K k y k 

= F k x k \ k _i + K k [y k - Hlx k \ k _i] 

with the initial condition Xol-i = ^0; where the gain matrix K k is given by 

K k = F k Y, k \ k _ 1 H k [H k Tj^^Hk + R k ] 
and the covariance of error matrices are recursively computed by 



(3.17) 



j fc+i|fc 



F k 



\\k-i — ^k\k-\H k (H k T, k \ k _iH k + Rk) H k Sfc|fc_i 



?T 



Fk + Qk 



(3.18) 



(3.19) 
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Figure 3.3: Block diagram of the Kalman filter. 



with the initial condition £ 



0|-1 



Pq. See Figure 3.3 for the block diagram. 



We comment briefly on the Kalman filter equations, before deriving them. First, observe what 
happens if H k is the zero matrix, H k = 0, for all k. Then the Kalman filter equations reduce to 
(3.16) with x k \ k _i = x k , ^k\k-i = P k and K k = 0. Taking H k = for all k is equivalent to having 
no observations available. 

In many applications, the sequence of gain matrices can be computed ahead of time according 
to (3.18) and (3.19). Then as the observations become available, the estimates can be computed 
using only (3.17). In some applications the matrices involved in the state and observation models, 
including the covariance matrices of the v k s and w k s, do not depend on k. The gain matrices 
K k could still depend on k due to the initial conditions, but if the model is stable in some sense, 
then the gains converge to a constant matrix K, so that in steady state the filter equation (3.17) 
becomes time invariant: x k+ u k = (F — KH T )x k i k _i + Ky k . 

In other applications, particularly those involving feedback control, the matrices in the state 
and/or observation equations might not be known until just before they are needed. 

The Kalman filter equations are now derived. Roughly speaking, there are two considerations 
for computing x k+ \\ k once x k \ k -\ is computed: (1) the time update, accounting for the change in 
state from x k to x k +\, and (2) the information update, accounting for the availability of the new 
observation y k . Indeed, for k > 0, we can write 



x k+l\k — x k+l\k-l + Pfc+l|fc _ x k+l\k-l\ j 



(3.20) 



where the first term on the right of (3.20), namely x k+ i\ k _i, represents the result of modifying xwfc-i 
to take into account the passage of time on the state, but with no new observation. The difference 
in square brackets in (3.20) is the contribution of the new observation, y^, to the estimation of 

Xk+l- 

Time update: In view of the state update equation and the fact that w^ is uncorrelated with 
the random variables of y and has mean zero, 



x fc+ i| fc _i = E[F k x k + w k \y k l ) 

= F.Elxkly^j + Elwkly*- 1 } 

~ Fk x k\k-1 



(3.21) 



Thus, the time update consists of simply multiplying the previous estimate by F k . If there were no 
new observation, then this would be the entire Kalman filter. Furthermore, the covariance of error 
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matrix for predicting x k +i by x k +i\k-i, is given by 



^fc+l|fc-l — Cov(Xfc+l - X k+ i\ k _i) 

= Cov(F k (x k - x k \ k -i) + w k ) 

= Fk^k\k-\Fk + Qk- 



(3.22) 



Information update: Consider next the new observation y k . The observation y k is not totally 
new — for it can be predicted in part from the previous observations, or simply by its mean in 
the case k = 0. Specifically, we can consider y k = y k — E[y k | y ] to be the new part of the 
observation y k . Here, yo,yi,... is the linear innovation sequence for the observation sequence 
yo,yi, ■ ■ ., as defined in Section 3.5 (with the minor difference that here the vectors are indexed 
from time k = on, rather than from time k = 1). Since the linear span of the random variables in 
(f , y ,yk) is the same as the linear span of the random variables in (1, y , y k ), for the purposes 
of incorporating the new observation we can pretend that y k is the new observation rather than y k . 



By the observation equation and the facts E[v k ] = and E[y 



fc-l„,Ti 



0, it follows that 



E[y k | y 



fe-ii 



E 



Hlx k + w k \y k l 



H T k x 



fc|fc-i) 



so y k = y k — H k x k i k _ 1 . Since (l,y k 1 ,yk) and (l,y fc 1 ,]jk) have the same span and the random 



variables in y 
mean zero, 



fe-i 



are orthogonal to the random variables in y k , and all these random variables have 



x k+l\k 



x k+ i\y k 1 ,y k 



E 

x k+ i + E 



Xk+i ~ x k+ i\y 



k-l 



+ E[x k+ i - x k+ i\y k ] 



= Xfc+xifc.x + E [X k+ i - Xk+llJJk] ■ 

Therefore, the term to be added for the information update (in square brackets in (3.20)) is 
E [x k+ i — x k+ i\y k ] . Since x k +i — x~k+i an d jjk both have mean zero, the information update term 
can be simplified to: 



E[x k+ i - x k+ i\y k ] = K k y k , 
where (use the fact Cov(xfc + i — x k+ i,y k ) = Cov(x k +\, yk) because E[y k ] = 0) 

K k = Cov(xk + i,yk)Cov(y k )~ 1 - 



(3.23) 



(3.24) 



Putting it all together: Equation (3.20), with the time update equation (3.21) and the fact the 
information update term is Kkjjk, yields the main Kalman filter equation: 



Xk+i\k = F k x k \k-i + K k y k . 



(3.25) 
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Taking into account the new observation y k , which is orthogonal to the previous observations, yields 
a reduction in the covariance of error: 

s fc+i|fc = s fc+i|fc-i -Gov(K k y k ). (3.26) 

The Kalman filter equations (3.17), (3.18), and (3.19) follow easily from (3.25), (3.24), and (3.26), 
as follows. To convert (3.24) into (3.18), use 

Cov(x k+1 ,y k ) = Cov(F k x k + w k ,Hl(x k -x k \ k _ 1 ) + v k ) 

= Cov(F k x k , H k (x k - £fc| fc _i)) 

= Cov(F k (x k - £fc|jfc_i), H k (x k - x fe |fc_i)) 

= Fk^k\k-\Hk (3.27) 

and 

Cov(y k ) = Cov(H k (x k - x k \ k _i) + v k ) 

= Cov(H k (x k - x k \ k _i)) + Cov(w fc ) 

= H k S fc | fc _ 1 i^" fc + R k (3.28) 

To convert (3.26) into (3.19) use (3.22), (3.27), and (3.28) to get 

Cov(K k y k ) = K k Cav(y k )Kl 

= Cov(x k+ i,y k )Cov(y k )~ 1 Cov(y k ,x k+ i) 

This completes the derivation of the Kalman filtering equations. 

3.7 Problems 

3.1 Rotation of a joint normal distribution yielding independence 

Let X be a Gaussian vector with 

^W=( 1 5 °) Cov(X)=(2 | 

(a) Write an expression for the pdf of X that does not use matrix notation. 

(b) Find a vector b and orthonormal matrix U such that the vector Y defined by Y = U T (X — b) 
is a mean zero Gaussian vector such at Y\ and I2 ar e independent. 

3.2 Linear approximation of the cosine function over an interval 

Let G be uniformly distributed on the interval [0, n] (yes, [0, 7r], not [0,27r]). Suppose Y = cos(0) 
is to be estimated by an estimator of the form a + bQ. What numerical values of a and b minimize 
the mean square error? 



96 CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQ UARED ERROR ESTIMATION 

3.3 Calculation of some minimum mean square error estimators 

Let Y = X + N, where X has the exponential distribution with parameter A, and N is Gaussian 
with mean and variance a 2 . The variables X and N are independent, and the parameters A and 
a 2 are strictly positive. (Recall that E[X] = j and Var(X) = T2-) 

(a) Find i£[.XjY] and also find the mean square error for estimating X by -ELYjF]. 

(b) Does -ELYjF] = £?LY|Y]? Justify your answer. (Hint: Answer is yes if and only if there is no 
estimator for X of the form g(Y) with a smaller MSE than £?LY|Y].) 

3.4 Valid covariance matrix 

For what real values of a and b is the following matrix the covariance matrix of some real- valued 
random vector? 

2 16 
/v = | a 1 

b 1 

Hint: An symmetric n x n matrix is positive semi definite if and only if the determinant of every 
matrix obtained by deleting a set of rows and the corresponding set of columns, is nonnegative. 

3.5 Conditional probabilities with joint Gaussians I 

Let I J be a mean zero Gaussian vector with correlation matrix I J , where |p| < 1. 

(a) Express P{X < 1\Y) in terms of p, Y, and the standard normal CDF, 3>. 

(b) Find E[(X — Y) 2 \Y = y] for real values of y. 

3.6 Conditional probabilities with joint Gaussians II 

Let X, Y be jointly Gaussian random variables with mean zero and covariance matrix 

Cov| Y ) = (g 1 6 8 

You may express your answers in terms of the <& function defined by 3>(u) = f" ~7K =e ~ s ds. 

(a) FindP{|X-l| >2}. 

(b) What is the conditional density of X given that Y = 3? You can either write out the density 
in full, or describe it as a well known density with specified parameter values. 

(c) Find P{\X - E[X\Y]\ > 1}. 



3.7 An estimation error bound 

e = X -E[X | Y\. 

(a) If possible, compute -E[e 2 ]. If not, give an upper bound. 

(b) For what joint distribution of X and Y (consistent with the given information) is E[e 2 ] maxi- 
mized? Is your answer unique? 
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3.8 An MMSE estimation problem 

(a) Let X and Y be jointly uniformly distributed over the triangular region in the x — y plane with 
corners (0,0), (0,1), and (1,2). Find both the linear minimum mean square error (LMMSE) esti- 
mator estimator of X given Y and the (possibly nonlinear) MMSE estimator X given Y. Compute 
the mean square error for each estimator. What percentage reduction in MSE does the MMSE 
estimator provide over the LMMSE? 

(b) Repeat part (a) assuming Y is a N{0, 1) random variable and X = \Y\. 

3.9 Comparison of MMSE estimators for an example 

Let X = jj-jj, where U is uniformly distributed over the interval [0, 1]. 

(a) Find E[X\U] and calculate the MSE, E[(X - E[X\U}) 2 }. 

(b) Find E[X\U] and calculate the MSE, E[(X - E[X\U}) 2 }. 

3.10 Conditional Gaussian comparison 

Suppose that X and Y are jointly Gaussian, mean zero, with Var(A) = Var(y) = 10 and 
Cov(A, Y) = 8. Express the following probabilities in terms of the Q function. 

(a) p a = P{X > 2}. 

(b)p b = P(X>2\Y = 3). 

(c) p c = P{X > 2\Y > 3). (Note: p c can be expressed as an integral. You need not carry out the 
integration.) 

(d) Indicate how p a ,Pb, and p c are ordered, from smallest to largest. 

3.11 Diagonalizing a two-dimensional Gaussian distribution 

Let X = I ,„ ) be a mean zero Gaussian random vector with correlation matrix ' 



x 2 j v p i 

where \p\ < 1. Find an orthonormal 2 by 2 matrix U such that X = UY for a Gaussian vector 
Y = I ) such that Y\ is independent of Y 2 . Also, find the variances of Y\ and Y 2 . 

Note: The following identity might be useful for some of the problems that follow. If A, B, C, 
and D are jointly Gaussian and mean zero, then E[ABCD] = E[AB]E[CD] + E[AC]E[BD] + 
E[AD\E[BC\. This implies that E[A 4 } = 3E[A 2 } 2 , Var(A 2 ) = 2E[A 2 }, and Cov(A 2 ,B 2 ) = 
2Cov(A,B) 2 . Also, E[A 2 B] = 0. 



3.12 An estimator of an estimator 

Let X and Y be square integrable random variables and let Z = E[X \ Y], so Z is the MMSE 
estimator of X given Y. Show that the LMMSE estimator of X given Y is also the LMMSE 
estimator of Z given Y. (Can you generalize this result?). 

3.13 Projections onto nested linear subspaces 

(a) Use the Orthogonality Principle to prove the following statement: Suppose Vo and Vi are 
two closed linear spaces of second order random variables, such that Vo D Vi, and suppose X 



98 CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQ UARED ERROR ESTIMATION 

is a random variable with finite second moment. Let Z* be the random variable in Vi with the 

minimum mean square distance from X. Then Z* is the variable in Vi with the minimum mean 

square distance from Zq. (b) Suppose that X, Yi, and I2 are random variables with finite second 

moments. For each of the following three statements, identify the choice of subspace Vo and Vi 

such that the statement follows from part (a): 

(i)E[X\Y 1 ]=E[E[X\Y 1 ,Y 2 ]\Y l }. 

(ii) £[X|Yi] = E[ E[X\Y 1 ,Y 2 \ |Yi]. (Sometimes called the "tower property.") 

(iii) E[X] = E[E[X\Y\]]. (Think of the expectation of a random variable as the constant closest to 

the random variable, in the m.s. sense. 

3.14 Some identities for estimators 

Let X and Y be random variables with E'fX 2 ] < 00. For each of the following statements, determine 
if the statement is true. If yes, give a justification using the orthogonality principle. If no, give a 
counter example. 

(a) E[Xcos{Y)\Y] = E[X\Y] cos(F) 

(b) E[X\Y] = E[X\Y 3 } 

(c) E[X 3 \Y] = E[X\Y} 3 

(d) E[X\Y] = E[X\Y 2 } 

(e) E[X\Y] = E[X\Y 3 } 

3.15 Some identities for estimators, version 2 

Let X, Y, and Z be random variables with finite second moments and suppose X is to be estimated. 
For each of the following, if true, give a brief explanation. If false, give a counter example. 

(a) E[(X - E[X\Y}) 2 } < E[(X - E[X\Y, Y 2 ]) 2 ]. 

(b) E[(X - E[X\Y}) 2 } = E[(X - E[X\Y, Y 2 } 2 } if X and Y are jointly Gaussian. 

(c) E[ (X - E[E[X\Z] \Y}) 2 } < E[(X - E[X\Y}) 2 }. 

(d) If E[(X - E[X\Y)) 2 ) = Var(X), then X and Y are independent. 

3.16 Some simple examples 

Give an example of each of the following, and in each case, explain your reasoning. 

(a) Two random variables X and Y such that £7LY|Y] = £"LY|y], and such that i*/LY|Y| is not 
simply constant, and X and Y are not jointly Gaussian. 

(b) A pair of random variables X and Y on some probability space such that X is Gaussian, Y is 
Gaussian, but X and Y are not jointly Gaussian. 

(c) Three random variables X, Y, and Z, which are pairwise independent, but all three together are 
not independent. 



3.17 The square root of a positive-semidefinite matrix 

(a) True or false? If B is a matrix over the reals, then BB T is positive semidefinite. 

(b) True or false? If K is a symmetric positive semidefinite matrix over the reals, then there exists 
a symmetric positive semidefinite matrix S over the reals such that K = S 2 . (Hint: What if K is 
also diagonal?) 
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Let ( ) be a mean zero Gaussian vector with correlation matrix I ) , where \p\ < 1. 



3.18 Estimating a quadratic 

)e a mean zero Gaus; 

(a) Find .E[X 2 |Y], the best estimator of X 2 given Y. 

(b) Compute the mean square error for the estimator £LY 2 |Y]. 

(c) Find i?[A 2 |Y], the best linear (actually, affine) estimator of X 2 given Y, and compute the mean 
square error. 

3.19 A quadratic estimator 

Suppose Y has the N(0,1) distribution and that X = \Y\. Find the estimator for X of the form 
X = a + bY + cY 2 which minimizes the mean square error. (You can use the following numerical 
values: E[\Y\] = 0.8, E[Y 4 } = 3, E[\Y\Y 2 } = 1.6.) 

(a) Use the orthogonality principle to derive equations for a, b, and c. 

(b) Find the estimator X. 

(c) Find the resulting minimum mean square error. 



3.20 An innovations sequence and its application 

( Yi \ ( 1 0.5 0.5 \ 

Y 2 

Y 3 
\X J 



Let 



0.5 1 0.5 0.25 
0.5 0.5 1 0.25 



be a mean zero random vector with correlation matrix 

\ 0.25 0.25 1 / 

( % \ ( * 

(a) Let Y±,Y 2 , Y3 denote the innovations sequence. Find the matrix A so that Y 2 = ^ I ^2 

\y 3 ) \n 
(*) ( (^ 

(h) Find the correlation matrix of Y" 9 and cross covariance matrix Cov X, Y? 

U \ U- 

(c) Find the constants a, b, and c to minimize E[(X — aY\ — bY 2 — CY3) 2 ]. 

3.21 Estimation for an additive Gaussian noise model 

Assume x and n are independent Gaussian vectors with means x, n and covariance matrices T, x 

and T, n . Let y = x + n. Then x and y are jointly Gaussian. 

(a) Show that i£[a;|y] is given by either x + T, X (T, X + S n ) _1 (y — (x + n)) 

or T, n (T, x + S n ) _1 x + T, X (T, X + E n ) _1 (y - n). 

(b) . Show that the conditional covariance matrix of x given y is given by any of the three expressions: 

Z-'x ~ t-'xV-'x + *-"n) Z-'x = Z-ix\J-ix + *-"n) ^n = (/-'x ~^~ ^n ) 

(Assume that the various inverses exist.) 

3.22 A Kalman filtering example 

(a) Let a 2 > 0, let / be a real constant, and let xq denote a N(0,a 2 ) random variable. Consider 
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the state and observation sequences denned by: 

(state) x k+ i = fxk + w k 
(observation) y k = x k + Vk 



where w\,W2, . . .;v±,V2, ■ ■ ■ are mutually independent N(0, 1) random variables. Write down the 
Kalman filter equations for recursively computing the estimates xuk-i, the (scaler) gains K^, and 
the sequence of the variances of the errors (for brevity write o\ for the covariance or error instead 

of £fc|fc-i)- 

(b) For what values of / is the sequence of error variances bounded? 

3.23 Steady state gains for one-dimensional Kalman filter 

This is a continuation of the previous problem. 

(a) Show that lim^oo a\ exists. 

(b) Express the limit, c^, in terms of /. 

(c) Explain why a 2 ^ = 1 if / = 0. 

3.24 A variation of Kalman filtering 

(a) Let a 2 > 0, let /be a real constant, and let Xo denote a N(0,a 2 ) random variable. Consider 
the state and observation sequences defined by: 

(state) x fc+ i = fx k + w k 
(observation) y k = x^ + Wk 



where wi,W2, ■ ■ ■ are mutually independent iV(0, 1) random variables. Note that the state and ob- 
servation equations are driven by the same sequence, so that some of the Kalman filtering equations 
derived in the notes do not apply. Derive recursive equations needed to compute xuk-i, including 
recursive equations for any needed gains or variances of error. (Hints: What modifications need to 
be made to the derivation for the standard model? Check that your answer is correct for / = 1.) 

3.25 The Kalman filter for xm 

Suppose in a given application a Kalman filter has been implemented to recursively produce Xfc+ilfc 
for k > 0, as in class. Thus by time k, £fc+i|fc, ^k+i\ki ^felfc-i: an d Sufc_i are already computed. 
Suppose that it is desired to also compute xu^ at time k. Give additional equations that can be 
used to compute x~k\k- (You can assume as given the equations in the class notes, and don't need 
to write them all out. Only the additional equations are asked for here. Be as explicit as you can, 
expressing any matrices you use in terms of the matrices already given in the class notes.) 

3.26 An innovations problem 

Let Ui, U2, ■ ■ ■ be a sequence of independent random variables, each uniformly distributed on the 
interval [0, 1]. Let Yq = 1, and Y n = U-JJ2 ■ ■ ■ U n for n > 1. 
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(a) Find the variance of Y n for each n > 1. 

(b) Find E[Y n \Y , ..., r n _i] for n > 1. 

(c) Find E[Y n \Y , ..., F„_i] for ra > 1. 

(d) Find the linear innovations sequence Y = (Yq, Y\, . . .). 

(e) Fix a positive integer M and let 1^ = U\ + . . . + Um- Using the answer to part (d), find 
E[Xm\Yj, . . . , Ym], the best linear estimator of Xm given (Yq, . . . , Ym). 

3.27 Linear innovations and orthogonal polynomials for the normal distribution 

(a) Let X be a N(0, 1) random variable. Show that for integers n > 0, 

( n\ 

n even 



E[X n ] = i W2)!2"/2 

I n odd 

Hint: One approach is to apply the power series expansion for e x on each side of the identity 
E[e uX ] = e u > 2 , and identify the coefficients of u n . 

(b) Let X be a N(0, 1) random variable, and let Y n = X n for integers n > 1. Express the first four 
terms, Y\ through I4, of the linear innovations sequence of Y in terms of U. 

3.28 Linear innovations and orthogonal polynomials for the uniform distribution 

(a) Let U be uniformly distributed on the interval [—1,1]. Show that for integers n > 0, 

E[un] = f sti n even 
1 J \ nodd 

(b) Let l^j = C/ n for integers n > 0. Note that Y"o = 1- Express the first four terms, Y\ through I4, 
of the linear innovations sequence of Y in terms of U. 

3.29 Representation of three random variables with equal cross covariances 

Let K be a matrix of the form 

(1 a a 

a 1 a 

a a 1 

where a£l. 

(a) For what values of a is K the covariance matrix of some random vector? 

(b) Let a have one of the values found in part (a) . Fill in the missing entries of the matrix U, 

* * 
U 



to yield an orthonormal matrix, and find a diagonal matrix A with nonnegative entries, so that 
if Z is a three dimensional random vector with Cov(Z) = I, then UA^Z has covariance matrix 
K. (Hint: It happens that the matrix U can be selected independently of a. Also, 1 + 2a is an 
eigenvalue of K.) 
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3.30 Kalman filter for a rotating state 

Consider the Kalman state and observation equations for the following matrices, where O = 27r/10 
(the matrices don't depend on time, so the subscript k is omitted): 

(a) Explain in words what successive iterates F n x are like, for a nonzero initial state x (this is 
the same as the state equation, but with the random term w^ left off). 

(b) Write out the Kalman filter equations for this example, simplifying as much as possible (but 
no more than possible! The equations don't simplify all that much.) 

3.31 * Proof of the orthogonality principle 

Prove the seven statements lettered (a)-(g) in what follows. 

Let X be a random variable and let V be a collection of random variables on the same probability 

space such that 

(i) E[Z 2 } < +oc for each Z £ V 

(ii) V is a linear class, i.e., if Z, Z' £ V then so is aZ + bZ' for any real numbers a and b. 

(iii) V is closed in the sense that if Z n £ V for each n and Z n converges to a random variable Z in 

the mean square sense, then Z £ V. 

The Orthogonality Principle is that there exists a unique element Z* £ V so that E[(X — Z*) 2 ] < 

E[(X — Z) 2 ] for all Z £ V. Furthermore, a random variable W £ V is equal to Z* if and only if 

(X-W)±Z for all Z £ V. ((X - W) JL Z means £[(X - W)Z] = 0.) 

The remainder of this problem is aimed at a proof. Let d = mf{E[(X — Z) 2 ] : Z £ V}. By definition 

of infimum there exists a sequence Z n £ V so that E[(X — Z n ) 2 ] — > d as n — > +oo. 

(a) The sequence Z n is Cauchy in the mean square sense. 

(Hint: Use the "parallelogram law" : E[(U - V) 2 } + E[(U + V) 2 } = 2(E[U 2 } + E[V 2 }). Thus, by the 
Cauchy criteria, there is a random variable Z* such that Z n converges to Z* in the mean square 
sense. 

(b) Z* satisfies the conditions advertised in the first sentence of the principle. 

(c) The element Z* satisfying the condition in the first sentence of the principle is unique. (Consider 
two random variables that are equal to each other with probability one to be the same.) This 
completes the proof of the first sentence. 

(d) ("if" part of second sentence). If W £ V and (X - W) _L Z for all Z £ V, then W = Z*. 
(The "only if" part of second sentence is divided into three parts:) 

(e) E[(X - Z* - cZ) 2 ) > E[(X - Z*) 2 ) for any real constant c. 

(f) -2cE[(X - Z*)Z] + c 2 E[Z 2 } > for any real constant c. 

(g) {X — Z*) _L Z, and the principle is proved. 

3.32 * The span of two closed subspaces is closed 

Check that the span, Vi ® V2, of two closed linear spaces (defined in Proposition 3.2.4) is also a 
closed linear space. A hint for showing that V is closed is to use the fact that if (Z n ) is a m.s. 
convergent sequence of random variables in V, then each variable in the sequence can be represented 
as Z n = Z n ,i + Z ny 2, where Z n>i £ Vi, and E[(Z n - Z m ) 2 } = E[(Z n j. - Z m ,i) 2 ] + E[(Z ny2 - Z m}2 ) 2 }. 
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3.33 * Von Neumann's alternating projections algorithm 

Let Vi and V2 be closed linear subspaces of L 2 (Q, F, P), and let X £ L 2 (f2, T, P). Define a sequence 
(Z n : n > 0) recursively, by alternating projections onto Vi and V2, as follows. Let Zq = X, and 
for k > 0, let ^2fc+i = Ilv^^fc) and Z2/C+2 = nv 2 (^2fc+i)- The goal of this problem is to show that 
Z n -V nv 1 nV 2 (^)- The approach will be to establish that (Z n ) converges in the m.s. sense, by 
verifying the Cauchy criteria, and then use the orthogonality principle to identify the limit. Define 
D(i,j) = E[(Zi - Zj)} 2 for i > and j > 0, and let e; = D(i + \,i) for i > 0. 

(a) Show that e t = E[(Zi) 2 } - E[(Z i+1 ) 2 }. 

(b) Show that £^ e* ^ E i x2 ] < °°- 

(c) Use the orthogonality principle to show that for n > 1 and k > 0: 

D(n,n + 2k + l) = e n + D(n + 1, n + 2k + 1) 
D(n,n + 2k + 2) = D(n,n + 2k + 1) - € n+2k +i- 

(d) Use the above equations to show that for n > 1 and A; > 0, 

D(n, n + 2k + 1) = e n -\ \- e n+k - (e n+k +i H h £n+2fc) 

D(n, n + 2k + 2) = e n -\ \- e n+k - (e n+fc +i H h e n+2 fc+i). 

Consequently, D(n,m) < Y^I!=n €i ^ or ^ — n ^ m ' an< ^ therefore (Z n : n > 0) is a Cauchy sequence, 
so Z n -V Zqo for some random variable Z^. 

(e) Verify that Z^ £ Vi n V 2 . 

(f) Verify that (V - Z^) _L Z for any Z £ Vi D V 2 . (Hint: Explain why (X - Z n ) _L Z for all n, 
and let n — > 00.) 

By the orthogonality principle, (e) and (f) imply that Z M = ny in v 2 (^0- 
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Chapter 4 

Random Processes 

4.1 Definition of a random process 

A random process X is an indexed collection X = {Xt : £ £ T) of random variables, all on the 
same probability space (fi, J-, P). In many applications the index set T is a set of times. If T = Z, 
or more generally, if T is a set of consecutive integers, then X is called a discrete-time random 
process. If T = R or if T is an interval of R, then X is called a continuous-time random process. 
Three ways to view a random process X = (Xt : t £ T) are as follows: 

• For each t fixed, Xt is a function on 0. 

• X is a function on T x fi with value -?Q(w) for given t £ T and oj £ 0. 

• For each a; fixed with u; £ 0, -X"t(w) is a function of t, called the sample path corresponding 
to u>. 

Example 4.1.1 Suppose W±, W2, ■ ■ ■ are independent random variables with 

P{W k = 1} = P{W k = -1} = \ for each k, and suppose X = and X n = W x + • • • + W n 
for positive integers n. Let W = (W k : k > 1) and X = (X n : n > 0). Then W and X are 
both discrete-time random processes. The index set T for X is Z+. A sample path of W and a 
corresponding sample path of X are shown in Figure 4.1. 

The following notation is used: 

MX (£) = £[X t ] 

i?x(M) = £LY S A^] 

C x (s,i) = Cov(X s ,X t ) 

F x , n (xi,ti;...;x n ,t n ) = P{X tl < xi, . . . , X tn < x n } 

and nx is called the mean function , Rx is called the correlation function, Cx is called the covariance 
function, and Fx, n is called the nth order CDF. Sometimes the prefix "auto," meaning "self," is 
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W k (<o) 



' k 



k 



Figure 4.1: Typical sample paths. 

added to the words correlation and covariance, to emphasize that only one random process is 
involved. 

Definition 4.1.2 A second order random process is a random process (Xt : t € T) such that 

E[Xf] < +oo for all t <= T. 

The mean, correlation, and covariance functions of a second order random process are all well- 
defined and finite. 

If Xt is a discrete random variable for each t, then the nth order pmf of X is defined by 

Px,n(xi,ti;...;x n ,t n ) = P{X h = xi,...,X tn = x n }. 

Similarly, if X^, . . . ,Xt n are jointly continuous random variables for any distinct t\, . . . ,t n in T, 
then X has an nth order pdf fx,n, such that for ti, . . . , t n fixed, fx,n{xi, h; . . . ; x n , t n ) is the joint 
pdfofX tl ,...,X tn . 



Example 4.1.3 Let A and B be independent, N(0, 1) random variables. Suppose Xt = A + Bt + t 2 
for all i€l. Let us describe the sample functions, the mean, correlation, and covariance functions, 
and the first and second order pdf's of X. 

Each sample function corresponds to some fixed uo in £1. For uo fixed, A{u) and B(uj) are 
numbers. The sample paths all have the same shape-they are parabolas with constant second 
derivative equal to 2. The sample path for ui fixed has t = intercept A(u>), and minimum value 



A(u 



B(uY 



achieved at t = ^. Three typical sample paths are shown in Figure 4.2. The 



4 «*v~"~»~vi ^ - 2 

various moment functions are given by 



H X (t) = E[A + Bt + t 2 } = t 2 
R X (s,t) = E[(A + Bs + s 2 )(A + Bt + t 2 )} = 
Cx(s,t) = Rx(s,t) - nx(s)nx(t) = l + st. 



1 + st + s 2 t 2 



As for the densities, for each t fixed, Xt is a linear combination of two independent Gaussian random 
variables, and Xt has mean fix(t) = t 2 and variance Yai(Xt) = Cx(t,t) = 1 + t 2 . Thus, X t is a 
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Figure 4.2: Typical sample paths. 



N(t 2 , 1 + t 2 ) random variable. That specifies the first order pdf fx,i well enough, but if one insists 
on writing it out in all detail it is given by 



fx,i(x,t) 



VMi + t 2 ) 



exp 



2\2 



(x - n 

2(1 + i 2 ) 



For s and t fixed distinct numbers, X s and Xt are jointly Gaussian and their covariance matrix 
is given by 



Cov 



X s 

x t 



1 + s 2 1 + st 
1 + st 1 + t 2 



The determinant of this matrix is (s — t) 2 , which is nonzero. Thus X has a second order pdf fx,2- 
For most purposes, we have already written enough about fx,2 for this example, but in full detail 
it is given by 



fx,2(x,s;y,t) 



•>ir\s-t\ eXP I 2 V V-t 2 



1 /3;_ s 2 \ T / 1x»2 1 ^ „+ \ _1 / „, _ .2 



1 + S Z 1 + St 

1 + st 1 + t 2 



x — s 
y-t 2 



The nth order distributions of X for this example are joint Gaussian distributions, but densities 
don't exist for n > 3 because X tl ,X t2 , and X ta are linearly dependent for any ti,t2,ts. 



A random process (Xt : t G T) is said to be Gaussian if the random variables Xt : t G T 
comprising the process are jointly Gaussian. The process X in the example just discussed is 
Gaussian. All the finite order distributions of a Gaussian random process X are determined by the 
mean function fix and autocorrelation function Rx- Indeed, for any finite subset {t\,t2, ■ ■ ■ ,t n } 
of T, (Xt r , . . . , Xt n ) T is a Gaussian vector with mean (iJ,x(ti), • • • , ^x{t n )) T and covariance matrix 
with ijth element Cx(U,tj) = Rx{U,tj) — IJ>x(ti)/j,x(tj). Two or more random processes are said 
to be jointly Gaussian if all the random variables comprising the processes are jointly Gaussian. 
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Example 4.1.4 Let U = (Uk ■ k G 

Uk '■ k G Z are independent, and P\Uk = 1] = P[Uk 

be the random process obtained by letting X t = U n 



be a random process such that the random variables 
= -1] = \ for all k. Let X = (X t : t G R) 
for n < t < n + 1 for any n. Equivalently, 



Xt = U\ t \ . A sample path of U and a corresponding sample path of X are shown in Figure 4.3. 
Both random processes have zero mean, so their covariance functions are equal to their correlation 

X, 



a 


1 


i 

1 


i 


! * 




• 


• 


• i • 


. 



1 if [s\ 
else 



w 



Figure 4.3: Typical sample paths, 
function and are given by 

The random variables of U are discrete, so the nth order pmf of U exists for all n. It is given by 

PU,n\Xl, fclj . . . J Xn: k n ) = 



if (xi,...,x n )e{-i,i} n 

else 



for distinct integers k\, . . . , k n . The nth order pmf of X exists for the same reason, but it is a 
bit more difficult to write down. In particular, the joint pmf of X s and Xt depends on whether 
[,sj = \t\. If [s\ = \t\ then X s = Xt and if [s\ / [t\ then X s and Xt are independent. Therefore, 
the second order pmf of X is given as follows: 



Px,2(xi,h;x2,t 2 ) 



2 ^ L^-i J = L^-2 J and either x\ = x 2 = 1 or x\ 
\ if L*iJ + L*2j andxi,x 2 € {-1,1} 

, else. 



X2 



-1 



4.2 Random walks and gambler's ruin 

Suppose p is given with < p < 1. Let Wi, VF2, ... be independent random variables with 
P{Wi = 1} = p and P{Wi = —1} = 1 — p for i > 1. Suppose Xo is an integer valued random 
variable independent of (W±, W 2 , • • •), and for n > 1, define X n by X n = Xo + Wi + • • • + W n . 
A sample path of X = {X n : n > 0) is shown in Figure 4.4. The random process X is called a 
random walk. Write P^ and E^ for conditional probabilities and conditional expectations given 
that Xq = k. For example, PfcL4] = P[A \ Xq = k] for any event A. Let us summarize some of the 
basic properties of X. 
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b 



X n (w) 



• * 



// 



Figure 4.4: A typical sample path. 



• E k [X n ] = k + n(2p-l). 

• Var fc (X n ) = Var(k + W x + • • • + W n ) = 4np(l - p). 



• lim r 



-\ ,, 



2p — 1 (a.s. and m.s. under P&, k fixed). 



• lim ™ p 4v5^- c } = a>(c) - 



P jk {X n = fc + j-(n-j)} 



P J (1 ~~ P) n J ' f° r < i < n. 



Almost all the properties listed are properties of the one dimensional distributions of X. In 
fact, only the strong law of large numbers, giving the a.s. convergence in the third property listed, 
depends on the joint distribution of the X n 's. 

The so-called Gambler's Ruin problem is a nice example of the calculation of a probability 
involving the joint distributions of the random walk X. Interpret X n as the number of units of 
money a gambler has at time n. Assume that the initial wealth k satisfies k > 0, and suppose the 
gambler has a goal of accumulating b units of money for some positive integer b > k. While the 
random walk {X n : n > 0) continues on forever, we are only interested in it until it hits either or 
b. Let Rb denote the event that the gambler is eventually ruined, meaning the random walk reaches 
zero without first reaching b. The gambler's ruin probability is Pk[Rb\- A simple idea allows us to 
compute the ruin probability. The idea is to condition on the value of the first step W\, and then 
to recognize that after the first step is taken, the conditional probability of ruin is the same as the 
unconditional probability of ruin for initial wealth k + W\ . 
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Let r k = Pk[Rb] for < k < b, so r k is the ruin probability for the gambler with initial wealth 
k and target wealth b. Clearly vq = 1 and r b = 0. For 1 < k < b — 1, condition on W\ to yield 

r k = Pk{W 1 = l}P k [R b | Wi = 1] + P k {W l = -l}P k [R b | W 1 = -1] 

or r k = pr k+ i + (1 — p)r k _\. This yields b — 1 linear equations for the 6—1 unknowns n, . . . , r b -\. 

If p = \ the equations become r k = ^{r k -\ + rfc+i} so that r k = A + Bk for some constants A 
and 5. Using the boundary conditions r$ — 1 and r& = 0, we find that r^ = 1 — | in case p = g- 
Note that, interestingly enough, after the gambler stops playing, he'll have b units with probability 
| and zero units otherwise. Thus, his expected wealth after completing the game is equal to his 
initial capital, k. 

lip ^ 2 ) we seek a solution of the form r& = .Aflf + i?#f , where 9\ and 02 are the two roots of the 
quadratic equation 9 = p6 2 + (1 — p) and A, B are selected to meet the two boundary conditions. 
The roots are 1 and - =2 , and finding A and B yields, that \i p ^ \ 

i- P \ k fi-P xb 



r k = ^-^ v , 7 < k < b. 



Focus, now, on the case that p > g- By the law of large numbers, — ^ — > 2p — 1 a.s. as n — > oo. 
This implies, in particular, that X n — > +oo a.s. as n — > oo. Thus, unless the gambler is ruined 
in finite time, his capital converges to infinity. Let R be the event that the gambler is eventually 
ruined. The events R b increase with b because if b is larger the gambler has more possibilities to 
be ruined before accumulating b units of money: R b C R b +\ C • • • and R = L)?L k R b . Therefore by 
the countable additivity of probability, 



P k [R] = lim P k [R b ] = lim r k 



1 — p 



b—>oo b—yoo \ p 

Thus, the probability of eventual ruin decreases geometrically with the initial wealth k. 

4.3 Processes with independent increments and martingales 

The increment of a random process X = {Xt : t € T) over an interval [a, b] is the random variable 
X b — X a . A random process is said to have independent increments if for any positive integer n and 
any to < t± < ■ ■ ■ < t n in T, the increments Xt 1 — X to , . . . , Xf n — Xt n _ 1 are mutually independent. 
A random process {Xt : t € T) is called a martingale if £?[Xf] is finite for all t and for any 
positive integer n and t\ < ti < ■ • ■ < t n < t n +i, 

E[X tn+1 | X tl ,. .., X tn ] = X tn 

or, equivalently, 

E[X tn+1 - X tn \X tl ,..., X tn ] = 0. 
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If t n is interpreted as the present time, then t n+ \ is a future time and the value of (Xt 1 , . . . , X tn ) 
represents information about the past and present values of X. With this interpretation, the 
martingale property is that the future increments of X have conditional mean zero, given the past 
and present values of the process. 

An example of a martingale is the following. Suppose a gambler has initial wealth X$. Suppose 
the gambler makes bets with various odds, such that, as far as the past history of X can determine, 
the bets made are all for fair games in which the expected net gains are zero. Then if Xt denotes 
the wealth of the gambler at any time t > 0, then {Xt : t > 0) is a martingale. 

Suppose (Xt) is an independent increment process with index set T = M + or T = Z+, with Xq 
equal to a constant and with mean zero increments. Then X is a martingale, as we now show. 
Let t\ < ■ ■ ■ < t n+ \ be in T. Then (Xt l , . . . ,Xt n ) is a function of the increments Xt l — Xo,Xt 2 — 
Xt x , . . . , X tn — Xt n _ l , and hence it is independent of the increment Xt n+1 — Xt n . Thus 

E[X tn+1 - X tn \X tl ,..., X t „] = E[X tn+1 - X tn ] = 0. 

The random walk (X n : n > 0) arising in the gambler's ruin problem is an independent increment 
process, and if p = ^ it is also a martingale. 

The following proposition is stated, without proof, to give an indication of some of the useful 
deductions that follow from the martingale property. 

Proposition 4.3.1 (a) Let Xq, X\, X^, . . . be nonnegative random variables such that 

E[Xk+i | ^o> • • • ,Xk] < Xk for k > (such X is a nonnegative supermartingale) . Then 

p((maxA^)> 7 l < E[X ° ] 



0<k<n J ) 7 

(b) (Doob's 1? Inequality) Let Xq, X\, . . . be a martingale sequence with E[X%] < +oo for some 
n. Then 



E 



^ 2 

max X^ 

0<k<n 



< ±E[Xl\ 



4.4 Brownian motion 

A Brownian motion, also called a Wiener process, is a random process W = (Wt : t > 0) such that 

B.O P{W = 0} = 1. 

B.l W has independent increments. 

B.2 W t - W s has the iV(0, a 2 {t - s)) distribution for t > s. 

B.3 P[W t is a continuous function of t] = 1, or in other words, W is sample path continuous with 
probability one. 
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Figure 4.5: A typical sample path of Brownian motion. 

A typical sample path of a Brownian motion is shown in Figure 4.5. A Brownian motion, being a 
mean zero independent increment process with P{Wo = 0} = 1, is a martingale. 

The mean, correlation, and covariance functions of a Brownian motion W are given by 



w (t) = E[W t ] = E[W t - W ] 







and, for s < t, 



Rw(s,t) 



E[W s W t ] 
E[(W S - W )(W S 

E[(W S -W ) 2 } ■■ 



Wo + W t - W s )} 



a 2 s 



so that, in general, 



C w (s,t) = R W (s,t) = a 2 (sAt). 



A Brownian motion is Gaussian, because if = to < t\ < • • • < t n , then each coordinate of the 
vector {Wt l , ■ ■ ■ , Wt n ) is a linear combination of the n independent Gaussian random variables 
{Wn ~ Wt i _ 1 '. 1 < i < n). Thus, properties B.0-B.2 imply that W is a Gaussian random process 
with nw = and R\v(s, t) = a 2 (s A t). In fact, the converse is also true. If W = (W t : t > 0) is a 
Gaussian random process with mean zero and Rw(s, t) = a 2 (s A £), then B.0-B.2 are true. 

Property B.3 does not come automatically. For example, if W is a Brownian motion and if U 
is a Unif(0,l) distributed random variable independent of W, let W be defined by 



W t 



Wt + l 



{U=t}- 



Then P\W t = W t } = 1 for each t > and W also satisfies B.0-B.2, but W fails to satisfy B.3. 
Thus, W is not a Brownian motion. The difference between W and W is significant if events 
involving uncountably many values of t are investigated. For example, 

P{W t < 1 for < t < 1} / P{W t < 1 for < t < 1}. 
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4.5 Counting processes and the Poisson process 

A function / on R + is called a counting function if /(0) = 0, / is nondecreasing, / is right 
continuous, and / is integer valued. The interpretation is that /(£) is the number of "counts" 
observed during the interval (0, t\. An increment f(b) — f{a) is the number of counts in the interval 
(a, 6]. If ti denotes the time of the zth count for i > 1, then / can be described by the sequence 
(it). Or, if u\ = t\ and n« = t% — t%—\ for i > 2, then / can be described by the sequence (v,i). See 
Figure 4.6. The numbers ti,t2, ■ ■ ■ are called the count times and the numbers ui, U2, ■ ■ ■ are called 

t fit) 



u-. 



Figure 4.6: A counting function, 
the intercount times. The following equations clearly hold: 



/(') = J2ht>tn} 

n=l 
t n = min{£ : f(t) > n} 

t n = u\-\ hu„. 

A random process is called a counting process if with probability one its sample path is a 
counting function. A counting process has two corresponding random sequences, the sequence of 
count times and the sequence of intercount times. 

The most widely used example of a counting process is a Poisson process, defined next. 

Definition 4.5.1 Let A > 0. By definition, a Poisson process with rate X is a random process 
N = (N t : t > 0) such that 

N.l N is a counting process, 

N.2 N has independent increments, 

N.3 N(t) - N(s) has the Poi(X(t - s)) distribution for t > s. 

Proposition 4.5.2 Let N be a counting process and let A > 0. The following are equivalent: 
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(a) N is a Poisson process with rate A. 

(b) The intercount times U\, U2, ■ ■ ■ are mutually independent, Exp{\) random variables. 

(c) For each r > 0, N T is a Poisson random variable with parameter At, and given {N T = n}, the 

times of the n counts during [0, r] are the same as n independent, Unif[0, r] random variables, 
reordered to be nondecreasing. That is, for any n > 1, the conditional density of the first n 
count times, (Ti, . . . , T n ), given the event {N T = n}, is: 



f(ti,...,t n \N T = n) 



t" 





<ti < ■■■ <t n <T 

else 



(4.1) 



Proof. It will be shown that (a) implies (b), (b) implies (c), and (c) implies (a). 

(a) implies (b). Suppose TV is a Poisson process. The joint pdf of the first n count times 
T\, . . . ,T n can be found as follows. Let < t\ < ti < ■ ■ ■ < t n . Select e > so small that {t\ — e, ti], 
(p2 — e? ^2]; ■ • • j (t n ~ e > t n ] are disjoint intervals of M+. Then the probability that (Ti, . . . , T n ) is in 
the n-dimensional cube with upper corner t\, . . . ,t n and sides of length e is given by 



P{Ti G (U - e, ti] for 1 < % < n} 



1,AT ? 



,^t 2 -e 



P{N tl . e = 0,N tl -N tl - e 
(e- A (' 1 - e ))(Aee- Ae )(e- A(t2 - £ - tl) ) • • • (Aee 
(\e) n e- xtn . 



N tl =0,...,N tn 



Nt n -e = 1} 



The volume of the cube is e n . Therefore (T±, . . . , T n ) has the pdf 

fn-Tn^l, ■ ■ ■ ,t n ) = 



X n e -\t n 




if < t\ < ■ ■ ■ < t„ 

else. 



(4.2) 



The vector (U\, . . . , U n ) is the image of (Ti, . . . , T n ) under the mapping (ti, . . . ,t n ) — > (m, . . . , u n ) 
defined by u\ = t±, Uk = tk — ife-i f° r k > 2. The mapping is invertible, because tk = u\ + • • • + Uk 
for 1 < k < n, it has range WL, and the Jacobian 



Ou 



( 



V 



\ 



-1 1 ) 



has unit determinant. Therefore, by the formula for the transformation of random vectors (see 
Section 1.11), 

fu 1 ...u n (ui,...,u n ) = i n ^_ + . (4.3) 







else 



The joint pdf in (4.3) factors into the product of n pdfs, with each pdf being for an Exp(X) random 
variable. Thus the intercount times U\,U2, ■ ■ ■ are independent and each is exponentially distributed 
with parameter A. So (a) implies (b). 
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(b) implies (c). Suppose that TV is a counting process such that the intercount times Ui, U2, ■ ■ ■ 
are independent, Exp(X) random variables, for some A > 0. Thus, for n > 1, the first n intercount 
times have the joint pdf given in (4.3). Equivalently, appealing to the transformation of random 
vectors in the reverse direction, the pdf of the first n count times, (Ti, . . . , T n ), is given by (4.2). Fix 
r > and an integer n > 1. The event {N T = n} is equivalent to the event (Ti, . . . , T n +i) G A n)T , 
where 

A n)T = {te R^ +1 : < ti < • • • <t n <T< t n +i}. 

The conditional pdf of (Ti, . . . , T n+ i), given that {N T = n}, is obtained by starting with the joint 
pdf of (Ti, . . . , T n+1 ), namely A n+1 e- A (*" +1 ) on {t G R n+1 : < h < • • • < t n+1 }, setting it equal to 
zero off of the set A UtT , and scaling it up by the factor 1/P{N T = n} on A n y. 



X n+1 e 



< ti < ■ ■ ■ < t n < T < t r 



/(*1, • • • ,tn+l\N T = Ti) = { HNr=n} " - ^1 - ' • • - ^n ^ ' - ^n+1 (44) 

else 



The joint density of (Ti, . . . , T n ), given that {A^,- = n}, is obtained for each (ti,..., t n ) by integrating 
the density in (4.4) with respect to t n +i over R. If < t\ < • ■ • < t n < r does not hold, the density 
in (4.4) is zero for all values of t n +i. If < t\ < ■ ■ ■ < t n < r, then the density in (4.4) is nonzero 
for t n +\ G (r, 00). Integrating (4.4) with respect to t n +\ over (r, 00) yields: 



/(t 1 ,...,t„|7V T = n)= TC^l 0<*i<-"<«»<r (4.5) 

else 



The conditional density in (4.5) is constant over the set {t G R™ : < t\ < ■ ■ ■ < t n < r}. Since the 
density must integrate to one, that constant must be the reciprocal of the n-dimensional volume 
of the set. The unit cube [0,r] n in R n has volume r n . It can be partitioned into n\ equal volume 
subsets corresponding to the n! possible orderings of the numbers t\, . . . ,t n . Therefore, the set 
{t G R™ : < t\ < ■ ■ ■ < t n < r}, corresponding to one of the orderings, has volume T n /nl. Hence, 

(4.5) implies both that (4.1) holds and that P{N T = n} = - T ' f . These implications are for 

n > 1. Also, P{N T = 0} = P{Ui > r} = e" Ar . Thus, A^ T is a Poi(Ar) random variable. 

(c) implies (a). Suppose to < *i < • • • < tk and let m,...,nk be nonnegative integers. Set 
n — ri\ + . . . + ri] t and pi = (ti —ti-i)/tk for 1 < i < k. Suppose (c) is true. Given there are n counts 
in the interval [0, r], by (c), the distribution of the numbers of counts in each subinterval is as if 
each of the n counts is thrown into a subinterval at random, falling into the i th subinterval with 
probability p^. The probability that, for 1 < i < K, n» particular counts fall into the i interval, 
is p™ 1 ■ ■ ■ p 1 ^ . The number of ways to assign n counts to the intervals such that there are hi counts 
in the i interval is ( n ) = — r 2 — r . This thus gives rise to what is known as a multinomial 
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distribution for the numbers of counts per interval. We have 
P{N(ti) - N(U-i) = th for 1 < i < k} 



P {N(t k ) = n}P [N(tt) ~ N(U-i) = n l forl<i<k\ N(t k ) = n] 



(A£ fc ) n e~ Atfc ( n 



i ]Pi Pk 

M ml 

Therefore the increments N(ti) — N(ti-i), 1 < i < k, are independent, with N(ti) — N(ti-i) being 
a Poisson random variable with mean A(£j — £j-i), for 1 < i < k. So (a) is proved. I 

A Poisson process is not a martingale. However, if N is defined by Nt = Nt — At, then N is an 
independent increment process with mean and iVo = 0. Thus, TV is a martingale. Note that N 
has the same mean and covariance function as a Brownian motion with a 2 = A, which shows how 
little one really knows about a process from its mean function and correlation function alone. 

4.6 Stationarity 

Consider a random process X = {Xt : t € T) such that either T = Z or T = R. Then X is said to be 
stationary if for any t\, . . . ,t n and s in T, the random vectors (A\ , . . . , Xt n ) and (A^ 1+s , . . . , Xt n+S ) 
have the same distribution. In other words, the joint statistics of X of all orders are unaffected by 
a shift in time. The condition of stationarity of X can also be expressed in terms of the CDF's of 
X: X is stationary if for any n>l,s,ti,...,t n €.T, and xi, . . . , x n € R, 

Fx,n(xi,ti;...;x n ,t n ) = F x ,n(xi,h + s; . . . ;x n ;t n + s). 

Suppose X is a stationary second order random process. (Recall that second order means that 
-EpQ 2 ] < oo for all t.) Then by the n = 1 part of the definition of stationarity, X% has the same 
distribution for all t. In particular, [ixif) and -EpQ 2 ] do not depend on t. Moreover, by the n = 2 
part of the definition E[Xt 1 Xt 2 ] = E[Xt 1 + s Xt 2 + s ] for any s £ T. If -EfAT 2 ] < +oo for all t, then 
E[Xt+ s ] an d Rx(ti + s,t2 + s) are finite and both do not depend on s. 

A second order random process (Xt : £ £ T) with T = Z or T = R is called wide sense stationary 
(WSS) if 

Hx(t) = fix(s + t) and Rx(h,t2) = Rx(h + s,£ 2 + s) 

for all t, s, t\,t2 £ T. As shown above, a stationary second order random process is WSS. Wide 
sense stationarity means that fJ-x(t) is a finite number, not depending on t, and Rx(ti,t2) depends 
on ti, £2 only through the difference £1 —£2- By a convenient and widely accepted abuse of notation, 
if X is WSS, we use fix to be the constant and Rx to be the function of one real variable such that 

E[X t ] = fix ££T 

E[X h X t2 } = R x (ti-t 2 ) £i,£ 2 £T. 
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The dual use of the notation Rx if X is WSS leads to the identity Rx(ti, £2) — Rx(ti ~ £2)- As a 
practical matter, this means replacing a comma by a minus sign. Since one interpretation of Rx 
requires it to have two arguments, and the other interpretation requires only one argument, the 
interpretation is clear from the number of arguments. Some brave authors even skip mentioning 
that X is WSS when they write: "Suppose (Xt : t G M.) has mean \xx and correlation function 
Rx(t)," because it is implicit in this statement that X is WSS. 

Since the covariance function Cx of a random process X satisfies 



Cx(tl,t2) 



Rx(h,t2) - Hx(tl)fix(t2) 



if X is WSS then Cx(ti,t2) is a function of £1 — £2. The notation Cx is also used to denote the 
function of one variable such that Cx(t\ — £2) — Cov(Xf l ,Xt 2 ). Therefore, if X is WSS then 
Cx(ti — £2) = Cx(t\, £2)- Also, Cx(t) = Rx{t) — n\, where in this equation r should be thought 
of as the difference of two times, £1 — £2- 

In general, there is much more to know about a random vector or a random process than 
the first and second moments. Therefore, one can mathematically define WSS processes that 
are spectacularly different in appearance from any stationary random process. For example, any 
random process (X& :/;eZ) such that the X k are independent with i£[Xfc] = and Var(Xfc) = 1 
for all k is WSS. To be specific, we could take the X k to be independent, with X k being iV(0, 1) 
for k < and with X^ having pmf 



Px,i(x,k) = P{X k = x} 
else 



2J.2 x G \k, k\ 

£ ifx = 



1 



for k > 1. A typical sample path of this WSS random process is shown in Figure 4.7. 
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Figure 4.7: A typical sample path. 



The situation is much different if X is a Gaussian process. Indeed, suppose X is Gaussian and 
WSS. Then for any £1, £2, . . . , t n , s G T, the random vector (X tl + S , Xt 2 + S , . . . , Xt n+S ) T is Gaussian 
with mean (//, fi, . . . , jj) t and covariance matrix with i/th entry Cx((U + s) — (tj + s)) = Cx(U — tj). 
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This mean and covariance matrix do not depend on s. Thus, the distribution of the vector does 
not depend on s. Therefore, X is stationary. 

In summary, if X is stationary then X is WSS, and if X is both Gaussian and WSS, then X is 
stationary. 

Example 4.6.1 Let X t = Acos(uj c t + Q), where ui c is a nonzero constant, A and © are independent 
random variables with P{A > 0} = 1 and i?[^4 2 ] < +oo. Each sample path of the random process 
(Xt : t G K) is a pure sinusoidal function at frequency uj c radians per unit time, with amplitude A 
and phase G. 

We address two questions. First, what additional assumptions, if any, are needed on the distri- 
butions of A and 9 to imply that X is WSS? Second, we consider two distributions for 9 which 
each make X WSS, and see if they make X stationary. 

To address whether X is WSS, the mean and correlation functions can be computed as follows. 
Since A and 9 are independent and since cos(o; c t + 9) = cos(o; c t) cos(9) — sin(u; c t) sin(G), 

jj, x {t) = E[A) (E[cos(Q)) cos(uj c t) - £[sin(9)] sm(uj c t)) . 

Thus, the function /J,x(t) is a linear combination of cos(u; c i) and sin(cj c t). The only way such a 
linear combination can be independent of t is if the coefficients of both cos(u> c t) and sin(cu c i) are 
zero (in fact, it is enough to equate the values of fj,x(t) a t ^ct = 0, |, and it). Therefore, Hx(t) 
does not depend on t if and only if i?[cos(9)] = £'[sin(9)] = 0. 

Turning next to Rx, using the trigonometric identity cos(a) cos(b) = (cos(a — b) + cos(a + b))/2 
yields 

Rx(s,t) = E[A 2 ]E[cos(lo c s + &) cos(uj c t + &)} 

EL4 2 1 
= — [ — - {cos(uj c (s - t)) + E[cos{uj c (s + t) + 29)]} . 

Since s + 1 can be arbitrary for s — t fixed, in order that Rx(s, t) be a function of s — t alone it is 
necessary that E[cos(uj c (s + 1) + 29)] be a constant, independent of the value of s + 1. Arguing just 
as in the case of /j,x, with 9 replaced by 29, yields that Rx(s, t) is a function of s — t if and only 
if £[cos(29)] = £[sin(29)] = 0. 

Combining the findings for fix and Rx, yields that X is WSS, if and only if, 

£[cos(9)] = £[sin(9)] = £[cos(29)] = £[sin(29)] = 0. 

There are many distributions for 9 in [0, 2ir] such that the four moments specified are zero. Two 
possibilities are (a) 9 is uniformly distributed on the interval [0,27r], or, (b) 9 is a discrete random 
variable, taking the four values 0, f , 7T, ^f- with equal probability. Is X stationary for either 
possibility? 

We shall show that X is stationary if 9 is uniformly distributed over [0, 2tt\. Stationarity means 
that for any fixed constant s, the random processes (X t : t G M.) and (Xt +S :t£l) have the same 
finite order distributions. For this example, 

X t+ s = Acos(uj c (t + s) + 9) = Acos(uj c t + Q) 
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where G = ((w c s + 0) mod 2tt). By Example 1.4.4, G is again uniformly distributed on the interval 
[0, 27r]. Thus (A, Q) and (A, 0) have the same joint distribution, so Acos(uj c t+Q) and Acos(co c t+Q) 
have the same finite order distributions. Hence, X is indeed stationary if G is uniformly distributed 
over [0,27r]. 

Assume now that G takes on each of the values of 0, |, it, and ^ with equal probability. Is X 
stationary? If X were stationary then, in particular, Xt would have the same distribution for all t. 
On one hand, P{Xq = 0} = P{Q = | or 6 = ^} = |. On the other hand, if ui c t is not an integer 
multiple of ^, then co c t + G cannot be an integer multiple of ^, so P{Xt = 0} = 0. Hence X is not 
stationary. 

(With more work it can be shown that X is stationary, if and only if, (G mod 2ir) is uniformly 
distributed over the interval [0, 27r].) 



4.7 Joint properties of random processes 

Two random processes X and Y are said to be jointly stationary if their parameter set T is either 
Z or R, and if for any ti, ■ ■ ■ , t n , s £ T, the distribution of the random vector 

PCti+s) X t2 + S , . . . , X tn+S , Yt 1+S , Y t2+S , . . . , Y tn+S ) 

does not depend on s. 

The random processes X and Y are said to be jointly Gaussian if all the random variables 
comprising X and Y are jointly Gaussian. 

If X and Y are second order random processes on the same probability space, the cross corre- 
lation function, Rxy, is defined by Rxy(s,t) = E[X s Yt], and the cross covariance function, Cxy, 
is defined by CxY{s,t) = Cov(X s ,Y t ). 

The random processes X and Y are said to be jointly WSS, if X and Y are each WSS, and 
if RxY(s,t) is a function of s — t. If X and Y are jointly WSS, we use Rxy(t) for RxY(s,t) 
where r = s — t, and similarly Cxy(s — t) = Cxy(s,t). Note that CxY{s,t) = Cyx(t,s), so 
Cxy(t) = Cyx(-t). 

4.8 Conditional independence and Markov processes 

Markov processes are naturally associated with the state space approach for modeling a system. 
The idea of a state space model for a given system is to define the state of the system at any given 
time t. The state of the system at time t should summarize everything about the system up to and 
including time t that is relevant to the future of the system. For example, the state of an aircraft 
at time t could consist of the position, velocity, and remaining fuel at time t. Think of t as the 
present time. The state at time t determines the possible future part of the aircraft trajectory. For 
example, it determines how much longer the aircraft can fly and where it could possibly land. The 
state at time t does not completely determine the entire past trajectory of the aircraft. Rather, 
the state summarizes enough about the system up to the present so that if the state is known, 
no more information about the past is relevant to the future possibilities. The concept of state is 
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inherent in the Kalman filtering model discussed in Chapter 3. The notion of state is captured for 
random processes using the notion of conditional independence and the Markov property, which 
are discussed next. 

Let X, Y, Z be random vectors. We shall define the condition that X and Z are conditionally 
independent given Y. Such condition is denoted by X — Y — Z. If X, Y, Z are discrete, then 
X — Y — Z is defined to hold if 

P(X = i,Z = k\Y = j) = P(X = i\Y = j)P(Z = k\Y = j) (4.6) 

for all i, j, k with P{Y = j} > 0. Equivalently, X - Y - Z if 

P{X = i,Y = j,Z = k}P{Y = j} = P{X = i,Y = j}P{Z = k,Y = j} (4.7) 

for all i, j, k. Equivalently again, X - Y - Z if 

P(Z = k\ X = i,Y = j) = P(Z = k\Y = j) (4.8) 

for all i,j, k with P{X = i,Y = j} > 0. The forms (4.6) and (4.7) make it clear that the condition 
X — Y — Z is symmetric in X and Z: thus X — Y — Z is the same condition as Z — Y — X. The 
form (4.7) does not involve conditional probabilities, so no requirement about conditioning events 
having positive probability is needed. The form (4.8) shows that X — Y — Z means that knowing 
Y alone is as informative as knowing both X and Y, for the purpose of determining conditional 
probabilies of Z. Intuitively, the condition X — Y — Z means that the random variable Y serves 
as a state. 

If X, Y, and Z have a joint pdf, then the condition X — Y — Z can be defined using the 
pdfs and conditional pdfs in a similar way. For example, the conditional independence condition 
X — Y — Z holds by definition if 

fxz\y(x,z\y) = fx\Y(x\y)fz\Y( z \y) whenever f Y (y) > 

An equivalent condition is 

fz\xy(z\x,y) = fz\y(z\y) whenever f X v{x,y) > 0. (4.9) 

Example 4.8.1 Suppose X, Y, Z are jointly Gaussian vectors. Let us see what the condition 
X — Y — Z means in terms of the covariance matrices. Assume without loss of generality that the 
vectors have mean zero. Because X, Y, and Z are jointly Gaussian, the condition (4.9) is equivalent 
to the condition that E[Z\X,Y] = E[Z\Y] (because given X, Y, or just given Y, the conditional 
distribution of Z is Gaussian, and in the two cases the mean and covariance of the conditional 
distribution of Z is the same.) The idea of linear innovations applied to the length two sequence 
(Y, X) yields E[Z\X, Y] = E[Z\Y] + E[Z\X] where X = X- E[X\Y] . Thus X - Y - Z if and only if 
E[Z\X] = 0, or equivalently, if and only if Cov(A > , Z) = 0. Since X = X - Cov(X, y)Cov(y)- 1 F, 
if follows that 

Cov(X, Z) = Cov(X, Z) - Cov(X, F)Cov(F)- 1 Cov(y, Z). 
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Therefore, X — Y — Z if and only if 

Cov(X, Z) = Cov(X, y)Cov(y)- 1 Cov(F, Z). (4.10) 

In particular, if X, Y, and Z are jointly Gaussian random variables with nonzero variances, the 
condition X — Y — Z holds if and only if the correlation coefficients satisfy pxz — PxyPyz- 

A general definition of conditional probabilities and conditional independence, based on the 
general definition of conditional expectation given in Chapter 3, is given next. Recall that 
P(F) = E[Ip] for any event F, where Ip denotes the indicator function of F. If Y is a random 
vector, we define P(F\Y) to equal .E[/f|Y]. This means that P{F\Y) is the unique (in the sense 
that any two versions are equal with probability one) random variable such that 

(1) P{F\Y) is a function of Y and it has finite second moments, and 

(2) E[g(Y)P(F\Y)} = E[g(Y)I F ] for any g(Y) with finite second moment. 

Given arbitrary random vectors, we define X and Z to be conditionally independent given Y, 
(written X — Y — Z) if for any Borel sets A and B, 

P({X G A}{Z G B}\Y) = P(X G A\Y)P(Z G B\Y). 

Equivalents, X -Y - Z holds if for any Borel set B, P(Z G B\X, Y) = P(Z G B\Y). 

Definition 4.8.2 A random process X = (Xt : t G T) is said to be a Markov process if for any 
t\, . . . ,t n+ \ in T with t\ < ■ ■ ■ < t n , the following conditional independence condition holds: 

(X tl ,--.,X tn ) - X tn - X tn+1 (4.11) 

It turns out that the Markov property is equivalent to the following conditional independence 
property: For any t\, . . . , t n+m in T with t\ < ■ ■ ■ < t n+m , 

(X tl ,-..,X tn ) - X tn - (X tn ,..-,X tn+ J (4.12) 

The definition (4.11) is easier to check than condition (4.12), but (4.12) is appealing because it is 
symmetric in time. In words, thinking of t n as the present time, the Markov property means that 
the past and future of X are conditionally independent given the present state Xf n . 



Example 4.8.3 (Markov property of independent increment processes) Let (X t : t > 0) be an 
independent increment process such that Xq is a constant. Then for any t\, . . . , t n +i with < t\ < 
• • • < tn+i, the vector (X tl , . . . , X tn ) is a function of the n increments X tl — Xq, X t2 — X tl , X tn — 
Xt n _ 1 , and is thus independent of the increment V = Xf — Xt„. But Xt n+1 is determined by 
V and Xt n . Thus, X is a Markov process. In particular, random walks, Brownian motions, and 
Poisson processes are Markov processes. 
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Example 4.8.4 (Gaussian Markov processes) Suppose X = (X t : t G T) is a Gaussian random 
process with Var(JQ) > for all t. By the characterization of conditional independence for jointly 
Gaussian vectors (4.10), the Markov property (4.11) is equivalent to 



Cov( 



( Xtl ) 




( Xh \ 


Xt 2 


,X tn+1 ) = Cov( 


Xt 2 



V x tn ) 



,X t jVar(X t J- 1 Cov(X tn ,X tn+i ; 



V XU J 



which, letting p(s,t) denote the correlation coefficient between X s and Xt, is equivalent to the 
requirement 

/ p(h,t n +l) \ ( p(tl,t n ) \ 



p(t 2 ,t n+ i)) 



\ p{t„,t n+1 ) J 



p(t2,U, 



\ p(t n ,t n ) J 



p(t n ,t n+ \) 



Therefore a Gaussian process X is Markovian if and only if 



p(r, t) = p(r, s)p(s, t) whenever r, s, t G T with r < s < t. 



(4.13) 



If X = (Xk '■ k G Z) is a discrete-time stationary Gaussian process, then p(s, t) may be written 
as p(k), where k = s — t. Note that p(k) = p(—k). Such a process is Markovian if and only if 
p(k\ + k 2 ) = p(k\)p(k2) for all positive integers k\ and k 2 - Therefore, X is Markovian if and only if 
p(k) = b' k ' for all k, for some constant b with |6| < 1. Equivalently, a stationary Gaussian process 
X = (Xk : k G Z) with Var(Xk) > for all k is Markovian if and only if the covariance function 
has the form Cx(k) = Ab' k ' for some constants A and b with A > and \b\ < 1. 

Similarly, if (Xt : t G R) is a continuous-time stationary Gaussian process with Var(Xt) > 
for all t, X is Markovian if and only if p(s + t) = p(s)p(t) for all s, t > 0. The only bounded real- 
valued functions satisfying such a multiplicative condition are exponential functions. Therefore, a 
stationary Gaussian process X with Var(Xt) > for all t is Markovian if and only if p has the 
form p(r) = exp(— a|r|), for some constant a > 0, or equivalently, if and only if Cx has the form 
Cx(t) = Aexp(— a|r|) for some constants A > and a > 0. 



The following proposition should be intuitively clear, and it often applies in practice. 

Proposition 4.8.5 (Markov property of a sequence determined by a recursion driven by inde- 
pendent random variables) Suppose Xq, Ui, U 2 , ■ ■ ■ are mutually independent random variables and 
suppose (X n : n > 1) is determined by a recursion of the form X n+ \ = h n +i(X n , U n+ \) for n > 0. 
Then (X n : n > 0) is a Markov process. 

Proof. Let n > 1, B be an arbitrary Borel subset of M, and 4> be the function defined by 
(f>(x n ) = P{h n +i(x n ,U n +i) G B}. Let g be an arbitrary Borel measurable function of n variables 




4.9. DISCRETE-STATE MARKOV PROCESSES 123 

such that g(X\, . . . , X n ) has a finite second moment. Then 

E [ I {x n+ idB}9(Xi,...,X n )] = jf_ / g(x 1 ,...,x n )dF Un+1 (u)dF Xl ,...,x n (x 1 ,...,x n ) 

'{u:h n+1 (x n ,u)&B} 

dF Un+1 (u) g(xi, . . . ,x n )dF Xl ,...,x n (xi, ■ ■ ■ ,x r 

{u:hn+i(x n ,u)eB} J 

(x n )g(xi, ..., x n )dF Xu ...,x n (x 1 , ...,x n ) 
= E[4>(X n )g(X 1 ,...,X n )} 

Therefore, P(X n+ i G B\X\,...X n ) = (p(X n ). In particular, P(X n+ \ G B\Xi,...X n ) is a func- 
tion of X n so that P(X n+ i G B\X\, . . . X n ) = P{X n+ \ G i?|X n ). Since 5 is arbitrary it implies 
(Xi, . . . , X n ) — X n — X n+ \, so (X n : n > 0) is a Markov process. I 

For example, if the driving terms w^ '■ k > used for discrete-time Kalman filtering are in- 
dependent (rather than just being pairwise uncorrelated) , then the state process of the Kalman 
filtering model has the Markov property. 

4.9 Discrete-state Markov processes 

This section delves further into the theory of Markov processes in the technically simplest case of 
a discrete state space. Let S be a finite or countably infinite set, called the state space. Given a 
probability space (0, T, P), an S valued random variable is defined to be a function Y mapping £1 
to S such that {to : Y(u) = s} G T for each s G S. Assume that the elements of S are ordered 
so that <S = {ai, a2, • • • , a n } in case S has finite cardinality, or S = {oi, 02, (13, . . .} in case S has 
infinite cardinality. Given the ordering, an S valued random variable is equivalent to a positive 
integer valued random variable, so it is nothing exotic. Think of the probability distribution of 
an S valued random variable Y as a row vector of possibly infinite dimension, called a probability 
vector: py = (P{Y = a\}, P{Y = 02}, • • •)• Similarly think of a deterministic function 3 on 5 as 
a column vector, g = (g(a\ ), 5(02)5 • • -) T - Since the elements of S may not even be numbers, it 
might not make sense to speak of the expected value of an <S valued random variable. However, 
if is a function mapping S to the reals, then g(Y) is a real-valued random variable and its 
expectation is given by the inner product of the probability vector py and the column vector g: 
■^[sOO] = Sjg l sPv(*)5'(*) = Vy9- A random process X = (X t : t G T) is said to have state space 
S if Xt is an S valued random variable for each t G T, and the Markov property of such a random 
process is defined just as it is for a real valued random process. 

Let (Xt : t G T) be a be a Markov process with state space S. For brevity we denote the first 
order pmf of X at time t as n(t) = (vrj(t) : i G S). That is, 7Tj(£) = px(i,t) = P{X(t) = i}. The 
following notation is used to denote conditional probabilities: 

P( x t! =ji,...,X tn = j n \X Sl = ii,...,X Sm = i m ) = px(ji,h; . . . ; j n ,t n \ii, si; . . . ;im, s m ) 
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For brevity, conditional probabilities of the form P{Xt = j\X s = i) are written as pij(s,t), and are 
called the transition probabilities of X. 

The first order pmfs n(t) and the transition probabilities Pij(s, t) determine all the finite order 
distributions of the Markov process as follows. Given 



t\ < ti < ... <t n in T, 

ii, %2-, •••, i n ^ S 



(4.14) 



one writes 



Px(h,ti;--- ;in,t n ) 

= Px(h,h'r • • ;i n -i,t n -i)px(in,tn\h,h', • • • ',in-i,t n -i) 

= Px(h,h;- ■ ■ ;in-l,t n -l)Pi n - 1 i n (tn-l,t n ) 
Application of this operation n — 2 more times yields that 

px(h,ti; ■ ■ ■ ;i n ,t n ) = m 1 (tl)pi 1 i 2 (ti,t2) ■ ••Pi n -ii n (tn-l,t n ) (4-15) 

which shows that the finite order distributions of X are indeed determined by the first order pmfs 
and the transition probabilities. Equation (4.15) can be used to easily verify that the form (4.12) 
of the Markov property holds. 

Given s < t, the collection H(s,t) defined by H(s,t) = (pij(s,t) : i,j G S) should be thought 
of as a matrix, and it is called the transition probability matrix for the interval [s,t]. Let e denote 
the column vector with all ones, indexed by S. Since 7r(£) and the rows of H(s, t) are probability 
vectors, it follows that n(t)e = 1 and H{s,t)e = e. Computing the distribution of Xt by summing 
over all possible values of X s yields that Kj(t) = ^« P(X S = i, Xt = j) = Yli 7r i( s )P*i( ,s ) 0> which in 
matrix form yields that 7r(i) = n(s)H(s, t) for s,t ET,s <t. Similarly, given s < T < t, computing 
the conditional distribution of X t given X s by summing over all possible values of X T yields 

H(s, t) = H(s, t)H(t, t) s,r,teT, s <t <t. (4.16) 

The relations (4.16) are known as the Chapman-Kolmogorov equations. 

A Markov process is time-homogeneous if the transition probabilities Pij(s, t) depend on s and 
t only through t — s. In that case we write Pijit — s) instead of Pij(s,t), and Hij{t — s) instead of 
Hij(s,t). If the Markov process is time-homogeneous, then tt(s + t) = n(s)H(r) for s,s + t £ T and 
r > 0. A probability distribution n is called an equilibrium (or invariant) distribution if ixH{t) = n 
for all r > 0. 

Recall that a random process is stationary if its finite order distributions are invariant with 
respect to translation in time. On one hand, referring to (4.15), we see that a time- homogeneous 
Markov process is stationary if and only if n(t) = tt for all t for some equilibrium distribution n. 
On the other hand, a Markov random process that is stationary is time homogeneous. 

Repeated application of the Chapman-Kolmogorov equations yields that Pij(s,t) can be ex- 
pressed in terms of transition probabilities for s and t close together. For example, consider 
Markov processes with index set the integers. Then H(n, k + 1) = H(n, k)P{k) for n < k, where 
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P(k) = H(k, k + 1) is the one-step transition probability matrix. Fixing n and using forward re- 
cursion starting with H(n, n) = I, H(n, n + 1) = P(n), H(n, n + 2) = P{n)P{n + 1), and so forth 
yields 

tf (n, I) = P{n)P(n + 1) • • • P(7 - 1) 

In particular, if the chain is time-homogeneous then H(k) = P for all k, where P is the time 
independent one-step transition probability matrix, and n(l) = ir(k)P l ~ k for I > k. In this case a 
probability distribution n is an equilibrium distribution if and only if nP = tt. 



Example 4.9.1 Consider a two-stage pipeline through which packets flow, as pictured in Figure 
4.8. Some assumptions about the pipeline will be made in order to model it as a simple discrete- 
time Markov process. Each stage has a single buffer. Normalize time so that in one unit of time 
a packet can make a single transition. Call the time interval between k and k + 1 the kth "time 
slot," and assume that the pipeline evolves in the following way during a given slot. 

dj I I d-2 



Figure 4.8: A two-stage pipeline. 

If at the beginning of the slot, there are no packets in stage one, then a new packet arrives to stage 
one with probability a, independently of the past history of the pipeline and of the outcome 
at stage two. 

If at the beginning of the slot, there is a packet in stage one and no packet in stage two, then the 
packet is transfered to stage two with probability d\. 

If at the beginning of the slot, there is a packet in stage two, then the packet departs from the 
stage and leaves the system with probability d 2 , independently of the state or outcome of 
stage one. 

These assumptions lead us to model the pipeline as a discrete-time Markov process with the state 
space S = {00,01,10,11}, transition probability diagram shown in Figure 4.9 (using the notation 
x — 1 — x) and one-step transition probability matrix P given by 

( a a \ 

ad2 ac?2 ac?2 a^2 

di di 

\ d 2 d 2 j 

The rows of P are probability vectors. For example, the first row is the probability distribution of 
the state at the end of a slot, given that the state is 00 at the beginning of a slot. Now that the 
model is specified, let us determine the throughput rate of the pipeline. 
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Figure 4.9: One-step transition probability diagram. 

The equilibrium probability distribution n = (7Too,7Toi, 7rio,7Tii) is the probability vector satis- 
fying the linear equation n = ttP. Once tt is found, the throughput rate r/ can be computed as 
follows. It is defined to be the rate (averaged over a long time) that packets transit the pipeline. 
Since at most two packets can be in the pipeline at a time, the following three quantities are all 
clearly the same, and can be taken to be the throughput rate. 

The rate of arrivals to stage one 

The rate of departures from stage one (or rate of arrivals to stage two) 

The rate of departures from stage two 

Focus on the first of these three quantities to obtain 

T] = P{an arrival at stage 1} 

= P(an arrival at stage ljstage 1 empty at slot beginning) P(stage 1 empty at slot beginning) 

= a(n 00 + 7T i). 

Similarly, by focusing on departures from stage 1, obtain rj = diTTw- Finally, by focusing on 
departures from stage 2, obtain r\ = c^^oi + ^n)- These three expressions for r] must agree. 

Consider the numerical example a = d\ = d,2 = 0.5. The equation n = ttP yields that tt is 
proportional to the vector (1,2,3, 1). Applying the fact that tt is a probability distribution yields 
that tt = (1/7, 2/7, 3/7, 1/7). Therefore t] = 3/14 = 0.214 . . .. 



In the remainder of this section we assume that X is a continuous-time, finite-state Markov 
process. The transition probabilities for arbitrary time intervals can be described in terms of the 
transition probabilites over arbitrarily short time intervals. By saving only a linearization of the 
transition probabilities, the concept of generator matrix arises naturally, as we describe next. 

Let <S be a finite set. A pure-jump function for a finite state space S is a function x : 7Z+ — > S 
such that there is a sequence of times, = tq < t\ < ■ ■ ■ with lim^oo n = 00, and a sequence of 
states with Si / Sj+i, i > 0, such that that x(t) = s» for n < t < Tj+i. A pure-jump Markov process 
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is an S valued Markov process such that, with probability one, the sample functions are pure-jump 
functions. 

Let Q = (qij : i,j € S) be such that 

An example for state space S = {1,2,3} is 

Q = 




and this matrix Q can be represented by the transition rate diagram shown in Figure 4.10. A pure- 

0.5 




Figure 4.10: Transition rate diagram for a continuous- time Markov process. 

jump, time-homogeneous Markov process X has generator matrix Q if the transition probabilities 
ipij( T )) satisfy 

lim (pi j (h) - I {i=zj} )/h = qij i,j€S (4.18) 

or equivalently 

Pij(h) = I{i=j} + hqij + o(h) i, j e S (4.19) 

where o(h) represents a quantity such that lim/^o o{h)/h = 0. For the example this means that 
the transition probability matrix for a time interval of duration h is given by 

1 - h 0.5/i 0.5h \ I o(h) o{h) o(h) 
h l-2h h J + f o(h) o(h) o(h) 
h 1 - h J V o(h) oih) o{h) 

For small enough h, the rows of the first matrix are probability distributions, owing to the assump- 
tions on the generator matrix Q. 

Proposition 4.9.2 Given a matrix Q satisfying (4-17), and a probability distribution 
7r(0) = (ttj(0) : i € S), there is a pure-jump, time-homogeneous Markov process with generator 
matrix Q and initial distribution 7r(0). The finite order distributions of the process are uniquely 
determined by n(0) and Q. 
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The first order distributions and the transition probabilities can be derived from Q and an 
initial distribution 7f(0) by solving differential equations, derived as follows. Fix t > and let h be 
a small positive number. The Chapman-Kolmogorov equations imply that 

ITjjt + h) -TXjjt) _ ^ _^ fPij (h) - I {i=j} 



h ^ 

Letting h converge to zero yields the differential equation: 

dTTj(t) 



TTi(t) i^-L—™. . (4.20) 



^*i(t)qij (4.21) 



at ^ 

or, in matrix notation, S = n{i)Q. These equations, known as the Kolmogorov forward equa- 
tions, can be rewritten as 



diijit) 



J2 n(t)qa- Yl ^i (*)<&<> ( 4 - 22 ) 



dt 

which shows that the rate change of the probability of being at state j is the rate of probability flow 
into state j minus the rate of probability flow out of state j. 

The Kolmogorov forward equations (4.21), or equivalently, (4.22), for (7r(i) : t > 0) take as 
input data the initial distribution 7r(0) and the generator matrix Q. These equations include as 
special cases differential equations for the transition probability functions, Pij(t). After all, for i 
fixed, Pi 0j j(t) = P{Xt = J\Xq = i ) = 7Tj(t) if the initial distribution of (tt(£)) is 7Tj(0) = h i=i \. 
Thus, (4.21) specializes to 

dPlo d f t] = X)Pi 0l i(*)?ij Pi o ,i(0) = I {l=io} (4.23) 

Recall that H(t) is the matrix with (i,j) element equal to Pij(t). Therefore, for any i fixed, 
the differential equation (4.23) determines the i*' 1 row of (H(t);t > 0). The equations (4.23) 
for all choices of i can be written together in the following matrix form: g| - = H(t)Q with 
H(0) equal to the identify matrix. An occasionally useful general expression for the solution is 

H(t) = exp(Qt) £ ZZ 



■0 n! 



Example 4.9.3 Consider the two-state, continuous-time Markov process with the transition rate 
diagram shown in Figure 4.11 for some positive constants a and /3. The generator matrix is given 
by 

—a a 

(3 -p 

Let us solve the forward Kolmogorov equation for a given initial distribution 7r(0). The equation 
for 7Ti(t) is 

07Tl(t) 



Q 



-ani(t) +/37T 2 (t); vri(0) given 
at 
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Figure 4.11: Transition rate diagram for a two-state continuous-time Markov process. 

But 7Tl(i) = 1 — 7T2(t), SO 

^P = -(a + /3)7n(t) + /3; tti(O) given 
at 

By differentiation we check that this equation has the solution 

7ri(t) = 7n(0)e-( a+/3 ) < + [ e-^+^-^pds 

Jo 

a + p 

so that 

ir(t) = 7T(0)e-C*+-«* + ( -A,, -J-) (1 - e -(«+««) (4.24) 

\a + p a + p/ 

For any initial distribution 7r(0), 

/3 a 



lim 7r(i) = , 

<-oo w ya + ^'a + zJ 

The rate of convergence is exponential, with rate parameter a + f3, and the limiting distribution is 
the unique probability distribution satisfying ttQ = 0. 

By specializing (4.24) we determine H{t). Specifically, H{t) is a 2 x 2 matrix; its top row is n(t) 
for the initial condition 7r(0) = (1,0); its bottom row is n(t) for the initial condition 7r(0) = (0, 1); 
the result is: 

/ ae -(q+/3)t +/3 q(l- e -("+/3)*) \ 
#( f ) = I ^(i-e- + ("+>3)«) a+ p"-(~+p)t I • ( 4 - 25 ) 

Note that H (t) is a transition probability matrix for each t > 0, -ff(O) is the 2x2 identity matrix; 
each row of lim^oo H{t) is equal to lim^oo 7r(i). 



4.10 Space-time structure of discrete-state Markov processes 

The previous section showed that the distribution of a time- homogeneous, discrete-state Markov 
process can be specified by an initial probability distribution, and either a one-step transition 
probability matrix P (for discrete-time processes) or a generator matrix Q (for continuous-time 
processes). Another way to describe these processes is to specify the space-time structure, which is 
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simply the sequences of states visited and how long each state is visited. The space-time structure 
is discussed first for discrete-time processes, and then for continuous-time processes. One benefit 
is to show how little difference there is between discrete-time and continuous-time processes. 

Let (X k : k G Z + ) be a time-homogeneous Markov process with one-step transition probability 
matrix P. Let T k denote the time that elapses between the k and k + 1 jumps of X, and let 
X J {k) denote the state after k jumps. See Fig. 4.12 for illustration. More precisely, the holding 

X(k) 



s i 



x i 1 .) , „j, 

j 

X (3) 



• ■ X (2) 



■ ■ ■ ■ ■ ■ 
■ ■ i ■ ■ ■ ■ 



H M M< M 

T T i T 2 

Figure 4.12: Illustration of jump process and holding times. 

times are defined by 

T = min{£ > : X(t) / X(0)} (4.26) 

T k = mm{t>0:X(T +...+T k . 1 + t)^X(T +...+T k _ 1 )} (4.27) 

and the jump process X J = (X J (k) : k > 0) is defined by 

X J (0) = X(0) and X J (k) = X(T + . . . + r fc _i) (4.28) 

Clearly the holding times and jump process contain all the information needed to construct X, and 
vice versa. Thus, the following description of the joint distribution of the holding times and the 
jump process characterizes the distribution of X. 

Proposition 4.10.1 Let X = (X(k) : k G Z + ) be a time-homogeneous Markov process with one- 
step transition probability matrix P. 

(a) The jump process X is itself a time-homogeneous Markov process, and its one-step transition 
probabilities are given by pj, = Pij/(1 — Pa) for i / j, and p^ = 0, i,j G S. 

(b) Given X(0), X (1) is conditionally independent ofT^. 

(c) Given (X J (0), . . . , X J (n)) = (jo, . . . ,j n ), the variables To, . . . ,T n are conditionally indepen- 

dent, and the conditional distribution of T\ is geometric with parameter pj l j l : 

P(Ti = k\X J (0) = jo, . . . , X J (n) = j n ) = p^(l - p jai ) < I < n, k > 1. 
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Proof. Observe that if X(0) = i, then 

{T = k, X J (1) = j} = {X(l) = i, X{2) =i,...,X(k-l) = i, Xik) = j}, 

so 

P(T = k,X J (l) = j\X(0) =i)= p^ Pij = [(1 - V ii)vt X \vli (4.29) 

Because for i fixed the last expression in (4.29) displays the product of two probability distributions, 
conclude that given X(0) = i, 

Tq has distribution ((1 — Pii)p it ~ : k > 1), the geometric distribution of mean 1/(1 — pa) 

X (1) has distribution (jpj, : j <E S) (i fixed) 

To and X J (1) are independent 

More generally, check that 

P{X J {1) = ji, . . . , X\n) = j n , T o = k ,...,T n = k n \X J (0) = i) = 

n 
PijlPjlh ■ --Pi-dn HiPjli, 1 ^ ~ PjlJl)) 

1=0 

This establishes the proposition. I 

Next we consider the space-time structure of time-homogeneous continuous-time pure-jump 
Markov processes. Essentially the only difference between the discrete- and continuous-time Markov 
processes is that the holding times for the continuous-time processes are exponentially distributed 
rather than geometrically distributed. Indeed, define the holding times T^,k > and the jump 
process X J using (4.26)-(4.28) as before. 

Proposition 4.10.2 Let X = {X{t) : t G R+) be a time-homogeneous, pure-jump Markov process 
with generator matrix Q. Then 

(a) The jump process X is a discrete-time, time-homogeneous Markov process, and its one-step 
transition probabilities are given by 

j f -Qij/Qu for i + j (430) 

Pxj ^ fori = j { } 

(b) Given X(0), X (1) is conditionally independent of To. 

(c) Given X (0) = jo, . . . ,X(n) = j n , the variables To, . . . ,T n are conditionally independent, 

and the conditional distribution of Ti is exponential with parameter —qj l j r ~ 

P(T t > c\X J (0) = jo, . . . , X\n) = j n ) = exp(cq jdl ) < I < n. 
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Figure 4.13: Illustration of sampling of a pure-jump function. 

Proof. Fix h > and define the "sampled" process X^ by X( h \k) = X(hk) for A; > 0. See 
Fig. 4.13. Then X^ ' is a discrete-time Markov process with one-step transition probabilities Pij(h) 
(the transition probabilities for the original process for an interval of length h). Let (T^ : k > 0) 
denote the sequence of holding times and (X J,h (k) : k > 0) the jump process for the process X( h >. 

The assumption that with probability one the sample paths of X are pure-jump functions, 
implies that with probability one: 



rj,h, 



rj,hf 



lim(X J ' n (0),X J ' n (l),. 



-J,h 



(h) rrp(h) 



X J ' n (n),hT^ n ',hT^ 



(X J {0),X J {l),...,X J (n),T f) ,T 1 ,...,T n ) 



M h) ) 



(4.31) 



Since convergence with probability one implies convergence in distribution, the goal of identifying 
the distribution of the random vector on the righthand side of (4.31) can be accomplished by 
identifying the limit of the distribution of the vector on the left. 

First, the limiting distribution of the process X J ' h is identified. Since X^ h > has one-step transi- 
tion probabilities pij(h), the formula for the jump process probabilities for discrete-time processes 
(see Proposition 4.10.1, part a) yields that the one step transition probabilities p- ] for X^ J ' h ' are 
given by 



J,h 



Pij(h) 
1 -Pu(h) 

Pij(h)/h 
(l- Pii (h))/h 



-qu 



as h — > 



(4.32) 



for i / j, where the limit indicated in (4.32) follows from the definition (4.18) of the generator matrix 
Q. Thus, the limiting distribution of X J ' h is that of a Markov process with one-step transition 
probabilities given by (4.30), establishing part (a) of the proposition. The conditional independence 
properties stated in (b) and (c) of the proposition follow in the limit from the corresponding 
properties for the jump process X ,h guaranteed by Proposition 4.10.1. Finally, since log(l + 8) — 
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9 + o{9) by Taylor's formula, we have for all c > that 

P(hT{ h) >c\X J > h (0)= JO ,...,X J > h = j n ) = ( Pjljl (h))W h i 

= exp(Lc/^Jlog(p^, (/»))) 

= exp([c/h\(q jdl h + o(h))) 

— > exp(gj u ,c) as h —i- 

which establishes the remaining part of (c), and the proposition is proved. I 



4.11 Problems 

4.1 Event probabilities for a simple random process 

Define the random process X by Xt = 2A + Bt where A and B are independent random variables 
with P{A = 1} = P{A = -1} = P{£ = 1} = P{# = -1} = 0.5. (a) Sketch the possible sample 
functions, (b) Find P{X t > 0} for all t. (c) Find P{X t > for all t}. 

4.2 Correlation function of a product 

Let Y and Z be independent random processes with Ry(s,t) = 2exp(— \s — t\) cos(2n f(s — £)) and 
Rz(s,t) = 9 + exp(— 3|s — t| 4 ). Find the autocorrelation function Rx(s,t) where X t = Y t Z t . 

4.3 A sinusoidal random process 

Let Xt = Acos(2nVt + G) where the amplitude A has mean 2 and variance 4, the frequency V in 
Hertz is uniform on [0, 5], and the phase © is uniform on [0, 2n]. Furthermore, suppose A, V and G 
are independent. Find the mean function fix{t) and autocorrelation function Rx(s,t). Is X WSS? 

4.4 Another sinusoidal random process 

Suppose that X\ and X2 are random variables such that EX\ = EX% = EX\X<i = and Var(-Xi) = 
Var(X2) = a 2 . Define Yj = X\ cos(27r£) — X2sin(27ri). (a) Is the random process Y necessarily 
wide-sense stationary? (b) Give an example of random variables X\ and Xi satisfying the given 
conditions such that Y is stationary, (c) Give an example of random variables X\ and X2 satisfying 
the given conditions such that Y is not (strict sense) stationary. 

4.5 A random line 

Let X = (X t : t £ I) be a random process such that Xt = R — St for all t, where R and S are 
independent random variables, having the Rayleigh distribution with positive parameters o 2 R and 
(Tg, respectively. 

(a) Indicate three typical sample paths of X in a single sketch. Describe in words the set of possible 
sample paths of X. 

(b) Is X a Markov process? Why or why not? 

(c) Does X have independent increments? Why or why not? 

(d) Let A denote the area of the triangle bounded by portions of the coordinate axes and the graph 
of X. Find -Ef-A]. Simplify your answer as much as possible. 
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4.6 Brownian motion: Ascension and smoothing 

Let W be a Brownian motion process and suppose < r < s < t. 

(a) Find P{W r < W s < W t }. 

(b) Find £[W s |W r , W t ]. (This part is unrelated to part (a).) 

4.7 Brownian bridge 

Let W = (Wt : t > 0) be a standard Brownian motion (i.e. a Brownian motion with paramter 
a 2 = 1.) Let B t = W t - tW 1 for < £ < 1. The process 5 = (B t : < t < 1) is called a Brownian 
bridge process. Like W, B is a mean zero Gaussian random process. 

(a) Sketch a typical sample path of W, and the corresponding sample path of B. 

(b) Find the autocorrelation function of B. 

(c) IsBa Markov process? 

(d) Show that B is independent of the random variable W\. (This means that for any finite collec- 
tion, t\, . . . ,t n G [0, 1], the random vector (B tl , . . . ,Bt n ) T is independent of W\.) 

(e) (Due to J.L. Doob.) Let X t = (1 - i)W_t_, for < t < 1 and let Xi = 0. Let X denote the 

random process X = (Xt : < t < 1). Like W, X is a mean zero, Gaussian random process. Find 
the autocorrelation function of X. Can you draw any conclusions? 

4.8 Some Poisson process calculations 

Let N = (Nt : t > 0) be a Poisson process with rate A > 0. 

(a) Give a simple expression for P(N\ > 1|A^2 = 2) in terms of A. 

(b) Give a simple expression for P(N2 = 2\N± > 1) in terms of A. 

(c) Let Xt = Nf. Is X = (Xt : t > 0) a time-homogeneous Markov process? If so, give the 
transition probabilities Pij(r). If not, explain. 

4.9 A random process corresponding to a random parabola 

Define a random process X by Xt = A + Bt + 1 2 , where A and B are independent, N(0, 1) random 
variables, (a) Find .ELX5IX1], the linear minimum mean square error (LMMSE) estimator of X5 
given X\, and compute the mean square error, (b) Find the MMSE (possibly nonlinear) estimator 
of X5 given X\, and compute the mean square error, (c) Find E[X5\Xo, X\\ and compute the mean 
square error. (Hint: Can do by inspection.) 

4.10 MMSE prediction for a Gaussian process based on two observations 

Let X be a stationary Gaussian process with mean zero and Rx(t) = 5cos(^ : )3 _ '' r '. (a) Find the 
covariance matrix of the random vector (X (2) , X (3) , X (4)) T . (b) Find E[X(4)\X(2)\. (c) Find 
E[X(4)\X(2),X(3)}. 

4.11 A simple discrete-time random process 

Let U = (U n : n G Z) consist of independent random variables, each uniformly distributed on the 
interval [0,1]. Let X = (Xj~ : k G Z} be defined by X^ = max{[/^_i, £/&}• (a) Sketch a typical 
sample path of the process X. (b) Is X stationary? (c) Is X Markov? (d) Describe the first order 
distributions of X. (e) Describe the second order distributions of X. 
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4.12 Poisson process probabilities 

Consider a Poisson process with rate A > 0. 

(a) Find the probability that there is (exactly) one count in each of the three intervals [0,1], [1,2], 
and [2,3]. 

(b) Find the probability that there are two counts in the interval [0, 2] and two counts in the interval 
[1,3]. (Note: your answer to part (b) should be larger than your answer to part (a)). 

(c) Find the probability that there are two counts in the interval [1,2], given that there are two 
counts in the interval [0,2] and two counts in the the interval [1,3]. 

4.13 Sliding function of an i.i.d. Poisson sequence 

Let X = {X k : k G Z) be a random process such that the Xi are independent, Poisson random 
variables with mean A, for some A > 0. Let Y = (Y k : k G Z) be the random process defined by 
Yk = X k + Xk+i- 

(a) Show that Y k is a Poisson random variable with parameter 2A for each k. 

(b) Show that X is a stationary random process. 

(c) Is Y a stationary random process? Justify your answer. 

4.14 Adding jointly stationary Gaussian processes 

Let X and Y be jointly stationary, jointly Gaussian random processes with mean zero, autocorre- 
lation functions Rx(t) = Ry(t) = exp(— \t\), and cross-correlation function 
Rxvit) = (0.5) exp(-|t-3|). 

(a) Let Z(t) = (X(t) + Y(t))/2 for all t. Find the autocorrelation function of Z. 

(b) Is Z a stationary random process? Explain. 

(c) Find P{X(1) < 5Y(2) + 1}. You may express your answer in terms of the standard normal 
cumulative distribution function <I>. 

4.15 Invariance of properties under transformations 

Let X = {X n : n G Z), Y = (Y n : n G Z), and Z = (Z n : n G Z) be random processes such that 
Y n = X\ for all n and Z n = X^ for all n. Determine whether each of the following statements is 
always true. If true, give a justification. If not, give a simple counter example. 

(a) If X is Markov then Y is Markov. 

(b) If X is Markov then Z is Markov. 

(c) If Y is Markov then X is Markov. 

(d) If X is stationary then Y is stationary. 

(e) If Y is stationary then X is stationary. 

(f ) If X is wide sense stationary then Y is wide sense stationary. 

(g) If X has independent increments then Y has independent increments, 
(h) If X is a martingale then Z is a martingale. 

4.16 A linear evolution equation with random coefficients 

Let the variables A^, B^, k > be mutually independent with mean zero. Let A^ have variance a\ 
and let B^ have variance a\ for all k. Define a discrete-time random process Y by 
Y = (Y k : k > 0), such that Y = and Y k+1 = A k Y k + B k for k > 0. 
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(a) Find a recursive method for computing Pfc = £'[(Yfc) 2 ] for k > 0. 

(b) Is Y a Markov process? Explain. 

(c) Does Y have independent increments? Explain. 

(d) Find the autocorrelation function of Y. ( You can use the second moments (P&) in expressing 
your answer.) 

(e) Find the corresponding linear innovations sequence {Yk '■ k > 1). 

4.17 On an M/D/infinity system 

Suppose customers enter a service system according to a Poisson point process on 3R of rate A, 
meaning that the number of arrivals, N(a,b], in an interval (a, 6], has the Poisson distribution 
with mean X(b — a), and the numbers of arrivals in disjoint intervals are independent. Suppose 
each customer stays in the system for one unit of time, independently of other customers. Because 
the arrival process is memoryless, because the service times are deterministic, and because the 
customers are served simultaneously, corresponding to infinitely many servers, this queueing system 
is called an M/D /oo queueing system. The number of customers in the system at time t is given 
byX t = N{t-l,t}. 

(a) Find the mean and autocovariance function of A. 

(b) Is X stationary? Is X wide sense stationary? 

(c) Is X a Markov process? 

(d) Find a simple expression for P{Xt = for t G [0, 1]} in terms of A. 

(e) Find a simple expression for P{Xt > for t G [0, 1]} in terms of A. 

4.18 A fly on a cube 

Consider a cube with vertices 000, 001, 010, 100, 110, 101. 011, 111. Suppose a fly walks along 
edges of the cube from vertex to vertex, and for any integer t > 0, let Xt denote which vertex the 
fly is at at time t. Assume X = (Xt : t > 0) is a discrete-time Markov process, such that given Xt, 
the next state Xt+i is equally likely to be any one of the three vertices neighboring Xt- 

(a) Sketch the one step transition probability diagram for X. 

(b) Let Yt denote the distance of Xt, measured in number of hops, between vertex 000 and X t . For 
example, if Xt = 101, then Yt = 2. The process Y is a Markov process with states 0,1,2, and 3. 
Sketch the one-step transition probability diagram for Y. 

(c) Suppose the fly begins at vertex 000 at time zero. Let r be the first time that X returns to 
vertex 000 after time 0, or equivalently, the first time that Y returns to after time 0. Find E[t]. 

4.19 Time elapsed since Bernoulli renewals 

Let U = (Uk '■ k G Z) be such that for some p G (0, 1), the random variables Uk are independent, 
with each having the Bernoulli distribution with parameter p. Interpret Uk = 1 to mean that a 
renewal, or replacement, of some part takes place at time k. For k G Z, let 

Xk = min{i > 1 : Uk-i = 1}- In words, X^ is the time elapsed since the last renewal strictly before 
time k. 

(a) The process A is a time-homogeneous Markov process. Indicate a suitable state space, and 
describe the one-step transition probabilities. 

(b) Find the distribution of A& for k fixed. 
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(c) Is X a stationary random process? Explain. 

(d) Find the fc-step transition probabilities, Pij(k) = P{X n+ k = j\X n = i}. 

4.20 A random process created by interpolation 

Let U = (Uk : k £ Z) such that the Uk are independent, and each is uniformly distributed on 
the interval [0, 1]. Let X = (Xf : t £ R) denote the continuous time random process obtained by 
linearly interpolating between the U's. Specifically, X n = U n for any n £ Z, and Xt is affine on 
each interval of the form [n, n + 1] for n £ Z. 

(a) Sketch a sample path of [/ and a corresponding sample path of X. 

(b) Let £ £ R. Find and sketch the first order marginal density, fx,i(x,t). (Hint: Let n = [t\ and 
a = t — n, so that t = n + a. Then Xt = (1 — a)U n + aU n +i. It's helpful to consider the cases 
< a < 0.5 and 0.5 < a < 1 separately. For brevity, you need only consider the case < a < 0.5.) 

(c) Is the random process X WSS? Justify your answer. 

(d) Find P{max <t<ioX t < 0.5}. 

4.21 Reinforcing samples 

(Due to G. Polya) Suppose at time k = 2, there is a bag with two balls in it, one orange and one 
blue. During each time step between k and k + 1, one of the balls is selected from the bag at 
random, with all balls in the bag having equal probability. That ball, and a new ball of the same 
color, are both put into the bag. Thus, at time k there are k balls in the bag, for all k > 2. Let X^ 
denote the number of blue balls in the bag at time k. 

(a) Is X = (Xfc : k > 2) a Markov process? 

(b) Let Mk = -it-. Thus, M& is the fraction of balls in the bag at time k that are blue. Determine 
whether M = (M& : k > 2) is a martingale. 

(c) By the theory of martingales, since M is a bounded martingale, it converges a.s. to some 
random variable M^. Let V& = M^(l — M^). Show that £^[Vfc + i| Vk] = nhrrw ^fc; an d therefore that 

£[Vfc] = ^±i). It follows that Var(lim fc ^ 00 M fe ) = ^. 

(d) More concretely, find the distribution of M& for each fc, and then identify the distribution of 
the limit random variable, M n 



*oo- 



4.22 Restoring samples 

Suppose at time k = 2, there is a bag with two balls in it, one orange and one blue. During each 
time step between k and k + 1, one of the balls is selected from the bag at random, with all balls 
in the bag having equal probability. That ball, and a new ball of the other color, are both put into 
the bag. Thus, at time k there are k balls in the bag, for all k > 2. Let Xk denote the number of 
blue balls in the bag at time k. 

(a) Is X = (Xk : k > 2) a Markov process? If so, describe the one-step transition probabilities. 

(b) Compute E[X k+l \X k ) for k > 2. 

(c) Let Mk = -nr- Thus, Mfc is the fraction of balls in the bag at time k that are blue. Determine 
whether M = (Mk : k > 2) is a martingale. 

(d) Let D k = M k -\. Show that 



S [^-il^]=(Jb^l)2{*(*-2)i32 + J} 
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(e) Let v k = E[D^]. Prove by induction on k that v k < gx- What can you conclude about the limit 
of M k as k — > oo? (Be sure to specify what sense(s) of limit you mean.) 

4.23 A space-time transformation of Brownian motion 

Suppose X = (Xf : t > 0) is a real-valued, mean zero, independent increment process, and let 
-EpQ 2 ] = pt for t > 0. Assume pt < oo for all t. 

(a) Show that p must be nonnegative and nondecreasing over [0, oo). 

(b) Express the autocorrelation function Rx(s, i) in terms of the function p for all s > and t > 0. 

(c) Conversely, suppose a nonnegative, nondecreasing function p on [0, oo) is given. Let Y% = W(pt) 
for t > 0, where VF is a standard Brownian motion with i?v^(s,£) = min{s,t}. Explain why Y is 
an independent increment process with i?[lf 2 ] = pt for all t > 0. 

(d) Define a process Z in terms of a standard Brownian motion VF by Zq = and Z t = tW(j) for 
i > 0. Does Z have independent increments? Justify your answer. 

4.24 An M/M/l/B queueing system 

Suppose X is a continuous-time Markov process with the transition rate diagram shown, for a 
positive integer B and positive constant A. 

K K K K K 



(a) Find the generator matrix, Q, of X for B = 4. 

(b) Find the equilibrium probability distribution. (Note: The process X models the number of 
customers in a queueing system with a Poisson arrival process, exponential service times, one 
server, and a finite buffer.) 

4.25 Identification of special properties of two discrete-time processes 

Determine which of the properties: 

(i) Markov property 

(ii) martingale property 

(iii) independent increment property 
are possessed by the following two random processes. Justify your answers. 

(a) X = {Xk : k > 0) defined recursively by Xq = 1 and X^+i = (1 + Xk)Uk for k > 0, where 
Uq, U\, . . . are independent random variables, each uniformly distributed on the interval [0, 1]. 

(b) Y = (Y k : k > 0) defined by Y = V , Yi = F + Vi, and Y k = V k - 2 + V k -i + V k for k > 2, where 
V k : k €z Z are independent Gaussian random variables with mean zero and variance one. 

4.26 Identification of special properties of two discrete-time processes (version 2) 

Determine which of the properties: 

(i) Markov property 

(ii) martingale property 

(iii) independent increment property 
are possessed by the following two random processes. Justify your answers. 
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(a) (Xk : k > 0), where X^ is the number of cells alive at time A; in a colony that evolves as 
follows. Initially, there is one cell, so Xq = 1. During each discrete time step, each cell either dies 
or splits into two new cells, each possibility having probability one half. Suppose cells die or split 
independently. Let X^ denote the number of cells alive at time k. 

(b) (Y"fc : k > 0), such that Yq = 1 and, for k > 1, Yj. = U\JJi . . . Uk, where Ui, U2, ■ ■ ■ are independent 
random variables, each uniformly distributed over the interval [0, 2] 

4.27 Identification of special properties of two continuous-time processes 

Answer as in the previous problem, for the following two random processes: 

(a) Z = (Zt : t > 0), defined by Zt = exp(Wf— ^), where W is a Brownian motion with parameter 
a 2 . (Hint: Observe that E[Z t ] = 1 for all t.) 

(b) R = {Rt : t > 0) defined by Rt = D\ + D2 + • • • + D^ t , where N is a Poisson process with rate 
A > and Di : i > 1 is an iid sequence of random variables, each having mean and variance a 2 . 

4.28 Identification of special properties of two continuous-time processes (version 2) 

Answer as in the previous problem, for the following two random processes: 

(a) Z = (Zt : t > 0), defined by Z t = Wf, where W is a Brownian motion with parameter a 2 . 

(b) R = {Rt : t > 0), defined by Rt = cos(27ri + 0), where O is uniformly distributed on the interval 
[0,2tt]. 

4.29 A branching process 

Let p = {pi : i > 0) be a probability distribution on the nonnegative integers with mean m. 
Consider a population beginning with a single individual, comprising generation zero. The offspring 
of the initial individual comprise the first generation, and, in general, the offspring of the kth 
generation comprise the k + I s ' generation. Suppose the number of offspring of any individual has 
the probability distribution p, independently of how many offspring other individuals have. Let 
Yq = 1, and for k > 1 let Yfc denote the number of individuals in the k th generation. 

(a) Is Y = (Y"fc : k > 0) a Markov process? Briefly explain your answer. 

(b) Find constants c& so that — is a martingale. 

(c) Let a m = P{Y m = 0}, the probability of extinction by the m th generation. Express a m +i 
in terms of the distribution p and a m (Hint: condition on the value of Y±, and note that the Y\ 
subpopulations beginning with the Y\ individuals in generation one are independent and statistically 
identical to the whole population.) 

(d) Express the probability of eventual extinction, Oqo = \\va. m ^ 00 a m , in terms of the distribution 
p. Under what condition is Ooo = 1? 

(e) Find a^ in terms of 9 in case pu = 9 (1 — 9) for k > and < 9 < 1. (This distribution is 
similar to the geometric distribution, and it has mean m = j-zg.) 

4.30 Moving balls 

Consider the motion of three indistinguishable balls on a linear array of positions, indexed by the 
positive integers, such that one or more balls can occupy the same position. Suppose that at time 
t = there is one ball at position one, one ball at position two, and one ball at position three. 
Given the positions of the balls at some integer time t, the positions at time t + 1 are determined 
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as follows. One of the balls in the left most occupied position is picked up, and one of the other 
two balls is selected at random (but not moved), with each choice having probability one half. The 
ball that was picked up is then placed one position to the right of the selected ball. 

(a) Define a finite-state Markov process that tracks the relative positions of the balls. Try to 
use a small number of states. (Hint: Take the balls to be indistinguishable, and don't include 
the position numbers.) Describe the significance of each state, and give the one-step transition 
probability matrix for your process. 

(b) Find the equilibrium distribution of your process. 

(c) As time progresses, the balls all move to the right, and the average speed has a limiting value, 
with probability one. Find that limiting value. (You can use the fact that for a finite-state Markov 
process in which any state can eventually be reached from any other, the fraction of time the process 
is in a state i up to time t converges a.s. to the equilibrium probability for state i as t — > oo. 

(d) Consider the following continuous time version of the problem. Given the current state at time 
t, a move as described above happens in the interval [t, t + h] with probability h + o(h). Give the 
generator matrix Q, find its equilibrium distribution, and identify the long term average speed of 
the balls. 

4.31 Mean hitting time for a discrete-time, discrete-state Markov process 

Let (Xj; : k > 0) be a time-homogeneous Markov process with the one-step transition probability 
diagram shown. 

0.4 0.2 



0.8 0.4 



(a) Write down the one step transition probability matrix P. 

(b) Find the equilibrium probability distribution n. 

(c) Let r = minjfc > : X^ = 3} and let Oj = £[r|Xo = i] for 1 < i < 3. Clearly 03 = 0. Derive 
equations for a± and a 2 by considering the possible values of Xi, in a way similar to the analysis 
of the gambler's ruin problem. Solve the equations to find a\ and 02- 

4.32 Mean hitting time for a continuous-time, discrete-space Markov process 

Let {Xt : t > 0) be a time-homogeneous Markov process with the transition rate diagram shown. 

1 1 



1>- -\2> A3. 

10 " 5 



(a) Write down the rate matrix Q. 

(b) Find the equilibrium probability distribution 71". 

(c) Let r = min{£ > : Xt = 3} and let a» = i?[r|Xo = i] for 1 < i < 3. Clearly 03 = 0. Derive 
equations for a± and 02 by considering the possible values of Xt(h) for small values of h > and 
taking the limit as h —> 0. Solve the equations to find a\ and 02. 
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4.33 Poisson merger 

Summing counting processes corresponds to "merging" point processes. Show that the sum of K 
independent Poisson processes, having rates Ai, . . . , \k, respectively, is a Poisson process with rate 
Ai + . . . + \k- (Hint: First formulate and prove a similar result for sums of random variables, 
and then think about what else is needed to get the result for Poisson processes. You can use the 
definition of a Poisson process or one of the equivalent descriptions given by Proposition 4.5.2 in 
the notes. Don't forget to check required independence properties.) 

4.34 Poisson splitting 

Consider a stream of customers modeled by a Poisson process, and suppose each customer is one 
of K types. Let (pi, . . . ,pk) be a probability vector, and suppose that for each k, the k th customer 
is type i with probability pi. The types of the customers are mutually independent and also 
independent of the arrival times of the customers. Show that the stream of customers of a given 
type i is again a Poisson stream, and that its rate is Xpi. (Same hint as in the previous problem 
applies.) Show furthermore that the K substreams are mutually independent. 

4.35 Poisson method for coupon collector's problem 

(a) Suppose a stream of coupons arrives according to a Poisson process (A(t) : t > 0) with rate 
A = 1, and suppose there are k types of coupons. (In network applications, the coupons could be 
pieces of a file to be distributed by some sort of gossip algorithm.) The type of each coupon in the 
stream is randomly drawn from the k types, each possibility having probability t, and the types of 
different coupons are mutually independent. Let p(k, t) be the probability that at least one coupon 
of each type arrives by time t. (The letter "p" is used here because the number of coupons arriving 
by time t has the Poisson distribution). Express p(k,t) in terms of k and t. 

(b) Find lini/^oo p(k, k In k+kc) for an arbitrary constant c. That is, find the limit of the probability 
that the collection is complete at time t = k In k + kc. (Hint: If a^ — > a as k — > oo, then (1 + ^-) k — > 
e a .) 

(c) The rest of this problem shows that the limit found in part (b) also holds if the total number of 
coupons is deterministic, rather than Poisson distributed. One idea is that if t is large, then A(t) 
is not too far from its mean with high probability. Show, specifically, that 



f o if c < c' 
limk^oo P{A( kin k + kc) > kin k + kc'} = < , . r . 

[ 1 ii c > c 



(d) Let d(k, n) denote the probability that the collection is complete after n coupon arrivals. (The 
letter "d" is used here because the number of coupons, n, is deterministic.) Show that for any k,t, 
and n fixed, d(k,n)P{A(t) > n} < p(k,t) < P{A{t) > n} + P{A(t) < n}d(k,n). 

(e) Combine parts (c) and (d) to identify lim^oo d(k, kink + kc). 

4.36 Some orthogonal martingales based on Brownian motion 

(This problem is related to the problem on linear innovations and orthogonal polynomials in the 
previous problem set.) Let W = (Wt : t > 0) be a Brownian motion with a 2 = 1 (called a standard 

Brownian motion), and let Mt = exp(6Wi ^) for an arbitrary constant 9. 

(a) Show that (Mt : t > 0) is a martingale. (Hint for parts (a) and (b): For notational brevity, let 
W s represent (W u : < u < s) for the purposes of conditioning. If Z t is a function of Wt for each 
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t, then a sufficient condition for Z to be a martingale is that i?[Z t |W s ] = Z s whenever < s < t, 
because then E[Z t \Z u ,0 < u < s] = E[E[Z t \W s ]\Z u ,0 < u < s] = E[Z s \Z u ,0 < u < s] = Z s ). 
(b) By the power series expansion of the exponential function, 

exp(0W t -^) = l + OWt+ d -{W?-t)+ d -{W?-?,tWt) + --- 



fill 



n=0 

where M n (t) = t n ' 2 H n (^), and H n is the n Hermite polynomial. The fact that M is a martin- 
gale for any value of can be used to show that M n is a martingale for each n (you don't need to 
supply details). Verify directly that Wf — t and Wf — 3tWt are martingales. 

(c) For fixed t, (M n (t) : n > 0) is a sequence of orthogonal random variables, because it is the linear 
innovations sequence for the variables 1, Wt, W^, .... Use this fact and the martingale property of 
the M n processes to show that if n / m and s,t > 0, then M n {s) _L M m (t). 

4.37 A state space reduction preserving the Markov property 

Consider a time-homogeneous, discrete-time Markov process X = {X^ : k > 0) with state space 
S = {1,2,3}, initial state Xq = 3, and one-step transition probability matrix 
/ 0.0 0.8 0.2 
P= I 0.1 0.6 0.3 
V 0.2 0.8 0.0 

(a) Sketch the transition probability diagram and find the equilibrium probability distribution 
7T = (7ri,7r2,7r 3 ). 

(b) Identify a function / on S so that /(s) = a for two choices of s and f(s) = b for the third 
choice of s, where a / b, such that the process Y = (Yk '■ k > 0) defined by Yk = f{Xk) is a Markov 
process with only two states, and give the one-step transition probability matrix of Y. Briefly 
explain your answer. 

4.38 * Autocorrelation function of a stationary Markov process 

Let X = {Xk '■ k g Z) be a Markov process such that the state space, {pi, p2, ...,p n }, is a finite 
subset of the real numbers. Let P = (pij) denote the matrix of one-step transition probabilities. 
Let e be the column vector of all ones, and let ir(k) be the row vector 
vr(fc) = (P{X k = Pl }, ..., P{X k = p n }). 

(a) Show that Pe = e and 7r(fc + 1) = ir(k)P. 

(b) Show that if the Markov chain X is a stationary random process then ix{k) = n for all k, where 
7r is a vector such that n = ttP. 

(c) Prove the converse of part (b) . 

(d) Show that P(X k+m = pj\X k = p i ,X k _ 1 = si, ...,X k _ m = s m ) = p\™' , where p { ™> is the i,jih 
element of the rath power of P, P m , and si,...,s m are arbitrary states. 

(e) Assume that X is stationary. Express Rx{k) in terms of P, (pi), and the vector n of parts (b) 
and (c). 



Chapter 5 

Inference for Markov Models 



This chapter gives a glimpse of the theory of iterative algorithms for graphical models, as well as an 
introduction to statistical estimation theory. It begins with a brief introduction to estimation the- 
ory: maximum likelihood and Bayes estimators are introduced, and an iterative algorithm, known 
as the expectation-maximization algorithm, for computation of maximum likelihood estimators in 
certain contexts, is described. This general background is then focused on three inference problems 
posed using Markov models. 

5.1 A bit of estimation theory 

The two most commonly used methods for producing estimates of unknown quantities are the 
maximum likelihood (ML) and Bayesian methods. These two methods are briefly described in this 
section, beginning with the ML method. 

Suppose a parameter is to be estimated, based on observation of a random variable Y. An 
estimator of based on Y is a function 0, which for each possible observed value y, gives the 
estimate 0{y). The ML method is based on the assumption that Y has a pmf py(y|0) (if Y is 
discrete type) or a pdf fy(y\0) (if Y is continuous type), where is the unknown parameter to be 
estimated, and the family of functions py(y|#) or /y(y|#), is known. 

Definition 5.1.1 For a particular value y and parameter value 9, the likelihood of y for is 
Py(v\G), ify is discrete type, or /y(y|#), ifY is continuous type. The maximum likelihood estimate 
of given Y = y for a particular y is the value of that maximizes the likelihood of y. That 
is, the maximum likelihood estimator 9ml is given by 0ml{v) = argmax0py(y|0), or 0ml(u) = 
argmax e /y(y|6»). 

Note that the maximum likelihood estimator is not defined as one maximizing the likelihood 
of the parameter to be estimated. In fact, need not even be a random variable. Rather, the 
maximum likelihood estimator is defined by selecting the value of that maximizes the likelihood 
of the observation. 
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Example 5.1.2 Suppose Y is assumed to be a N(9,a 2 ) random variable, where a 2 is known. 
Equivalently, we can write Y = 9 + W, where W is a N(0, a 2 ) random variable. Given a value 

y is observed, the ML estimator is obtained by maximizing /y(y|6*) = , 2 exp I — ^ 2 ' J with 



^/2^ "^ F V 2ct2 
respect to 9. By inspection, 9ml{v) = 2/- 



Example 5.1.3 Suppose Y" is assumed to be a Poi{9) random variable, for some 9 > 0. Given the 
observation Y = k for some fixed k > 0, the ML estimator is obtained by maximizing py{k\9) = 
^-Tn — with respect to 0. Equivalently, dropping the constant k\ and taking the logarithm, 9 is to 
be selected to maximize — + k\n9. The derivative is —1 + k/9, which is positive for 9 < k and 
negative for 9 > k. Hence, 9Mi{k) = k. 

Note that in the ML method, the quantity to be estimated, 9, is not assumed to be random. 
This has the advantage that the modeler does not have to come up with a probability distribution 
for 9, and can still impose hard constraints on 9. But the ML method does not permit incorporation 
of soft probabilistic knowledge the modeler may have about 9 before any observation is used. 

The Bayesian method is based on estimating a random quantity. Thus, in the end, the variable 
to be estimated, say Z, and the observation, say Y, are jointly distributed random variables. 

Definition 5.1.4 The Bayes estimator of Z given Y, for jointly distributed random variables Z 
and Y, and cost function C(z,y), is the function Z = g(Y) ofY which minimizes the average cost, 
E[C(Z,Z)}. 

The assumed distribution of Z is called the prior or a priori distribution, whereas the conditional 
distribution of Z given Y is called the posterior or a posteriori distribution. In particular, if Z is 
discrete, there is a prior pmf, pz, and a posterior pmf, Pz\Yi or if % an d Y are jointly continuous, 
there is a prior pdf, fz, and a posterior pdf, fz\y- 

One of the most common choices of the cost function is the squared error, C(z, z) = (z — z) 2 , for 
which the Bayes estimators are the minimum mean squared error (MMSE) estimators, examined 
in Chapter 3. Recall that the MMSE estimators are given by the conditional expectation, g{y) = 
E[Z\Y = y], which, given the observation Y = y, is the mean of the posterior distribution of Z 
given Y = y. 

A commonly used choice of C in case Z is a discrete random variable is C(z, z) = Is z a^\. In 
this case, the Bayesian objective is to select Z to minimize P{Z ^ Z}, or equivalently, to maximize 
P{Z = Z}. For an estimator Z = g(Y), 

P{Z = z} = J2p(Z = g{y)\Y = y) PY (y) = 5>z|y(<?(y)|y)py(y)- 

y y 

So a Bayes estimator for C(z,z) = I{ z ^z\ ls one such that g{y) maximizes P{Z = g{y)\Y = y) for 
each y. That is, for each y, g{y) is a maximizer of the posterior pmf of Z. The estimator, called 
the maximum a posteriori probability (MAP) estimator, can be written concisely as 



Zmap{v) = argmaxp Z |y(z|y). 
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Suppose there is a parameter to be estimated based on an observation Y, and suppose that 
the pmf of Y, py(y|#), is known for each 0. This is enough to determine the ML estimator, but 
determination of a Bayes estimator requires, in addition, a choice of cost function C and a prior 
probability distribution (i.e. a distribution for 0). For example, if 9 is a discrete variable, the 
Bayesian method would require that a prior pmf for 9 be selected. In that case, we can view the 
parameter to be estimated as a random variable, which we might denote by the upper case symbol 
O, and the prior pmf could be denoted by ps(9). Then, as required by the Bayesian method, the 
variable to be estimated, O, and the observation, Y, would be jointly distributed random variables. 
The joint pmf would be given by pe t y(0, Y) = Pq{0)py{v\0)- The posterior probability distribution 
can be expressed as a conditional pmf, by Bayes' formula: 

, a , x Pe(9)PY(y\8) ,- ,x 

PY{y) 

where Py(u) — 52 0' PO,y(Q' \v)- Given y, the value of the MAP estimator is a value of 9 that 
maximizes Pq\y{0\v) with respect to 9. For that purpose, the denominator in the right-hand side 
of (5.1) can be ignored, so that the MAP estimator is given by 

®MAp(y) = argmaxp©|y(%) 
6 

= argmaxp e (6)p Y (y\9). (5.2) 

6 

The expression, (5.2), for Qmap{v) is rather similar to the expression for the ML estimator, 
9ml{u) — argmaxfl py(v\0)- In fact, the two estimators agree if the prior pq{0) is uniform, meaning 
it is the same for all 9. 

The MAP criterion for selecting estimators can be extended to the case that Y and are jointly 
continuous variables, leading to the following: 

®MAp(y) = argmax/©|y(6l|y) 



= argmaxfeWMvW- (5-3) 

6 

In this case, the probability that any estimator is exactly equal to is zero, but taking @MAp(y) 
to maximize the posterior pdf maximizes the probability that the estimator is within e of the true 
value of 0, in an asymptotic sense as e — > 0. 

Example 5.1.5 Suppose Y is assumed to be a N(0,a 2 ) random variable, where the variance a 2 is 
known and is to be estimated. Using the Bayesian method, suppose the prior density of is the 
iV(0, b 2 ) density for some known paramber b 2 . Equivalently, we can write Y = G + W, where G is a 
iV(0, b 2 ) random variable and W is a iV(0, a 2 ) random variable, independent of G. By the properties 
of joint Gaussian densities given in Chapter 3, given Y = y, the posterior distribution (i.e. the 
conditional distribution of G given y) is the normal distribution with mean i?[Q|Y = y] = 6 2 + ^2 

l2 2 

and variance rjrr 2 • The mean and maximizing value of this conditional density are both equal to 
-E[G|F = y\. Therefore, Q>mmse(v) — @MAp(y) = E(Q\Y = y). It is interesting to compare this 
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example to Example 5.1.2. The Bayes estimators (MMSE and MAP) are both smaller in magnitude 
than 9ml{v) = y, by the factor &2 2 . If b 2 is small compared to a 2 , the prior information indicates 
that \9\ is believed to be small, resulting in the Bayes estimators being smaller in magnitude than 
the ML estimator. As b 2 — > oo, the priori distribution gets increasingly uniform, and the Bayes 
estimators coverge to the ML estimator. 



Example 5.1.6 Suppose Y is assumed to be a Poi{9) random variable. Using the Bayesian 
method, suppose the prior distribution for is the uniformly distribution over the interval [0, 9 ma , x ], 
for some known value 9 max . Given the observation Y = k for some fixed k > 0, the MAP estimator 
is obtained by maximizing 

e~ B e k I{o<e<e ema j 



PY(k\9)f @ (6) 



k\ 



max 



with respect to 9. As seen in Example 5.1.3, the term ^-^ — is increasing in 9 for 9 < k and 
decreasing in 9 for 9 > k. Therefore, 

<3>MAp(k) = min{£;,# max }. 

It is interesting to compare this example to Example 5.1.3. Intuitively, the prior probability distri- 
bution indicates knowledge that 9 < 9 max , but no more than that, because the prior restricted to 
9 < #max is uniform. If 6* max is less than k, the MAP estimator is strictly smaller than 9Mi(k) = k. 
As # max — * oo, the MAP estimator converges to the ML estimator. Actually, deterministic prior 
knowledge, such as 9 < 9 mSuK , can also be incorporated into ML estimation as a hard constraint. 

The next example makes use of the following lemma. 

Lemma 5.1.7 Suppose Cj > for 1 < i < n and that c = Y17=i c i > 0- Then ^r=i c *^°SPi ^ s 
maximized over all probability vectors p = {p\. . . . ,p n ) by pi = Ci/c. 

Proof. If Cj = for some j, then clearly pj = for the maximizing probability vector. By 
eliminating such terms from the sum, we can assume without loss of generality that Cj > for 
all i. The function to be maximized is a strictly concave function of p over a region with linear 
constraints. The positivity constraints, namely pi > 0, will be satisfied with strict inequality. 
The remaining constraint is the equality constraint, Y^i=iPi = 1- We thus introduce a Lagrange 
multiplier A for the equality constraint and seek the stationary point of the Lagrangian L(p, A) = 
Y27=i Ci l°gP* — ^((SiLi Pi) ~ !)• By definition, the stationary point is the point at which the partial 
derivatives with respect to the variables pi are all zero. Setting 4— = ^ — A = yields that Pi = j- 
for all i. To satisfy the linear constraint, A must equal c. I 



Example 5.1.8 Suppose b = (b±,b2, ■ ■ ■ ,b n ) is a probability vector to be estimated by observing 
Y = (Yi, . . . , Yt). Assume Y\, . . . , Yp are independent, with each Yj having probability distribution 
b: P{Yt = i} = bi for 1 < t < T and 1 < i < n. We shall determine the maximum likelihood 
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estimate, &Afi(y)> given a particular observation y = (yi, . . . , yx). The likelihood to be maximized 
with respect to b is p(y\b) = b yi ■ ■ ■ b yT = YYi=i V where ki = \{t : yt = i}\. The log likelihood is 
lnp(y\b) = Y17=i kiln(bi). By Lemma 5.1.7, this is maximized by the empirical distribution of the 
observations, namely bi = -f for 1 < i < n. That is, buh = ( ~r, • • • , y )• 



Example 5.1.9 This is a Bayesian version of the previous example. Suppose b = (pi, bi, • • • j b n ) 
is a probability vector to be estimated by observing Y = (Y"i, . . . , Yt), and assume Yi, . . . , Yt are 
independent, with each Yt having probability distribution b. For the Bayesian method, a distribution 
of the unknown distribution b must be assumed. That is right, a distribution of the distribution 
is needed. A convenient choice is the following. Suppose for some known numbers di > 1 that 
(bi, . . . , b n -i) has the prior density: 



f B (b) = { Z(d) 



if bi > for 1 < i < n - 1, and ^^l 1 b i ^ 1 

else 



where b n = l — b\ — ■ ■ ■ — 6 n _i, and Z(d) is a constant chosen so that fs integrates to one. A larger 
value of di for a fixed i expresses an a priori guess that the corresponding value bi may be larger. It 
can be shown, in particular, that if B has this prior distribution, then -E[.Bj] = . ' , . The MAP 

estimate, bMAp(y), for a given observation vector y, is given by: 

b M Ap(y) = argmaxln (j ' B (b)p(y\b)) = argmax < - ln(Z(d)) + ^(di - 1 + ki) ln(6j) > 

By Lemma 5.1.7, b MA p(y) = (^^, • • • , d ""~ + H where f = E?=i(*-1+Ai) = T-n+^ti d i- 
Comparison with Example 5.1.8 shows that the MAP estimate is the same as the ML estimate, 
except that di — 1 is added to ki for each i. If the d^s are integers, the MAP estimate is the ML 
estimate with some prior observations mixed in, namely, di — 1 prior observations of outcome i for 
each i. A prior distribution such that the MAP estimate has the same algebraic form as the ML 
estimate is called a conjugate prior, and the specific density f B for this example is a called the 
Dirichlet density with parameter vector d. 



Example 5.1.10 Suppose that Y = (Yi, . . . , Yt) is observed, and that it is assumed that the Y 
are independent, with the binomial distribution with parameters n and q. Suppose n is known, and 
q is an unknown parameter to be estimated from Y. Let us find the maximum likelihood estimate, 
QML{y), for a particular observation y = (yi, . . . , y^). The likelihood is 



p(y\q) = fl |Y n V(i - vf~ yt ] = «z s (i - 'i) 



nT- 
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where s = y\ + • • • + yr, an d c depends on y but not on q. The log likelihood is lnc + sln(q) + 
(nT — s) ln(l — q). Maximizing over q yields quh — -%p- An alternative way to think about this is 
to realize that each Yt can be viewed as the sum of n independent Bernoulli^) random variables, 
and s can be viewed as the observed sum of nT independent Bernoulli(g) random variables. 

5.2 The expectation-maximization (EM) algorithm 

The expectation-maximization algorithm is a computational method for computing maximum like- 
lihood estimates in contexts where there are hidden random variables, in addition to observed data 
and unknown parameters. The following notation will be used. 

9, a parameter to be estimated 

X, the complete data 

Pcd{x\6), the pmf of the complete data, which is a known function for each value of 9 

Y = h(X), the observed random vector 

Z, the unobserved data (This notation is used in the common case that X has the form X = 
(Y,Z).) 

We write p{y\9) to denote the pmf of Y for a given value of 9. It can be expressed in terms of the 
pmf of the complete data by: 

p(y\9)= Yl PcdW) (5-4) 

{x:h(x)=y} 

In some applications, there can be a very large number of terms in the sum in (5.4), making it 
difficult to numerically maximize p{y\9) with respect to 9 (i.e. to compute 0mx(j/))- 

Algorithm 5.2.1 (Expectation- maximization (EM) algorithm) An observation y is given, along 
with an intitial estimate 9^°> . The algorithm is iterative. Given 0( ', the next value 9^ +l > is com- 
puted in the following two steps: 

(Expectation step) Compute Q(9\9^ k >) for all 9, where 

Q(9\dW) = E[ log Pcd (X\d) | !/,<?<*>]. (5.5) 

(Maximization step) Compute 0( k+l > £ argmaxg Q(9\9^ k >). In other words, find a value 9^ k+l > of 
9 that maximizes Q{9\9^ k >) with respect to 6. 

Some intuition behind the algorithm is the following. If a vector of complete data x could 
be observed, it would be reasonable to estimate 9 by maximizing the pmf of the complete data, 
Pcd{x\9), with respect to 9. This plan is not feasible if the complete data is not observed. The idea is 
to estimate \ogp c d{X\Q) by its conditional expectation, Q{9\9^ k >), and then find 9 to maximize this 
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conditional expectation. The conditional expectation is well denned if some value of the parameter 
9 is fixed. For each iteration of the algorithm, the expectation step is completed using the latest 
value of 9, 9^ k \ in computing the expectation of \ogp c d(X\9). 

In most applications there is some additional structure that helps in the computation of Q(9\6^ '). 
This typically happens when p C( i factors into simple terms, such as in the case of hidden Markov 
models discussed in this chapter, or when p c d has the form of an exponential raised to a low degree 
polynomial, such as the Gaussian or exponential distribution. In some cases there are closed form 
expressions for Q(9\9^ '). In others, there may be an algorithm that generates samples of X with 
the desired pmf p c d(x\9^>) using random number generators, and then \ogp c d(X\9) is used as an 
approximation to Q(9\9^ '). 

Example 5.2.2 (Estimation of the variance of a signal) An observation Y is modeled as Y = S+N, 
where the signal S is assumed to be a -/V(0, 9) random variable, where 9 is an unknown parameter, 
assumed to satisfy 9 > 0, and the noise N is a _/V(0,o~ 2 ) random variable where a 2 is known and 
strictly positive. Suppose it is desired to estimate 9, the variance of the signal. Let y be a particular 
observed value of Y. We consider two approaches to finding 9ml '■ a direct approach, and the EM 
algorithm. 

For the direct approach, note that for 9 fixed, Y is a iV(0, 9 + a 2 ) random variable. Therefore, 
the pdf of Y evaluated at y, or likelihood of y, is given by 



me) 



2 

exp( — 



2(0+ CT 2 )< 



^2n{9 + a 



2\ 



The natural log of the likelihood is given by 

log(27r) log(0 + a 2 ) y 2 



log f(y\6) 



+ a 2 ) 



Maximizing over 9 yields 9ml — (y 2 — < j2 ) + - While this one-dimensional case is fairly simple, the 
situation is different in higher dimensions, as explored in Problem 5.5. Thus, we examine use of 
the EM algorithm for this example. 

To apply the EM algorithm for this example, take X = (S, N) as the complete data. The 
observation is only the sum, Y = S + N, so the complete data is not observed. For given 9, S and 
./V are independent, so the log of the joint pdf of the complete data is given as follows: 

log Pcd (s,n\e) = y~ ~ Ye 2 ^. 

For the estimation step, we find 

Q(9\9^) = E[ log Pcd (S,N\d)\y,eW] 

log(27r6>) E[S 2 \y,9^} log(2^r 2 ) E[N 2 \y, 0<*)] 
2 29 2 2^ 2 ' 
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For the maximization step, we find 

dQ(d\6W) _ _1_ E[S 2 \y,9^} 
89 ~ 29 + 29 2 

from which we see that 9^ k+1 > = E[S 2 \y, 9^ ']. Computation of E[S 2 \y, 9^ '] is an exercise in 
conditional Gaussian distributions, similar to Example 3.4.5. The conditional second moment is 
the sum of the square of the conditional mean and the variance of the estimation error. Thus, the 
EM algorithm becomes the following recursion: 

Problem 5.3 shows that if 9^ > 0, then 9^ —> 9ml as k —> oo. 

Proposition 5.2.5 below shows that the likelihood p(y\6^ ') is nondecreasing in k. In the ideal 
case, the likelihood converges to the maximum possible value of the likelihood, and lini/^oo 9^ k > = 
9ml{v)- However, the sequence could converge to a local, but not global, maximizer of the likelihood, 
or possibly even to an inflection point of the likelihood. This behavior is typical of gradient type 
nonlinear optimization algorithms, which the EM algorithm is similar to. Note that even if the 
parameter set is convex (as it is for the case of hidden Markov models), the corresponding sets 
of probability distributions on Y are not convex. It is the geometry of the set of probability 
distributions that really matters for the EM algorithm, rather than the geometry of the space of 
the parameters. Before the proposition is stated, the divergence between two probability vectors 
and some of its basic properties are discussed. 

Definition 5.2.3 The divergence between probability vectors p = (pi, . . . ,p n ) and q = (q\, . . . , q n ), 
denoted by D(p\\q), is defined by D(p\\q) = J2 i Pilog(pi/qi), with the understanding that pi\og(jpi / qi) - 
ifpi = and Pilog(pi/qi) = +oo if pi > q, L = 0. 

Lemma 5.2.4 (Basic properties of divergence) 
(i) D(jp\\q) > 0, with equality if and only if p = q 
(ii) D is a convex function of the pair (p, q). 



Proof. Property (i) follows from Lemma 5.1.7. Here is another proof. In proving (i), we can 

assume th< 

inequality, 



I U lo£T U U ^ 

assume that o,- > for all i. The function <t>(u) = < „ 'is convex. Thus, by Jensen's 

K ' I u = 



D(p\\q) = J> (J J Qi > <(> ( J^f. ■ * j = #!) = °> 



so (i) is proved. 
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The proof of (ii) is based on the log-sum inequality, which is the fact that for nonnegative 
numbers a\, . . . , a n , b\, . . . , b n : 

^ajlog-^ > alog=, (5.7) 






h 



where a = Y2i a i an d b = ^2 i b{. To verify (5.7), note that it is true if and only if it is true with each 
a,i replaced by cai, for any strictly positive constant c. So it can be assumed that a = 1. Similarly, 
it can be assumed that 6=1. For a = b = 1, (5.7) is equivalent to the fact D(a\\b) > 0, already 
proved. So (5.7) is proved. 

Let < a < 1. Suppose p> = (p[, . . . ,p J n ) and q 3 = (q[, . . . , qh) are probability distributions for 
j = 1,2, and let pi = ap\ + (1 — a)pj and qi = aq\ + (1 — a)qf, for 1 < % < n. That is, (p 1 , q 1 ) and 
(p 2 ,q 2 ) are two pairs of probability distributions, and (p, q) = a(p x , q 1 ) + (1 — a)(p 2 ,q 2 ). For i fixed 
with 1 < i < n, the log-sum inequality (5.7) with (a±, (12, &i, 62) = («Pi , (1 — «)Pi > oiq}, (1 — a)g|) 
yields 

a^log4 + (l-«k 2 log4 = apJlog^4 + (l-a)^bg^— ^4 

> Pi log— . 

Summing each side of this inequality over i yields aD(p 1 \\q 1 ) + (1 — a)D(p 2 \\q 2 ) > D(p\\q), so that 
D(p\\q) is a convex function of the pair (p, q). M 

Proposition 5.2.5 (Convergence of the EM algorithm) Suppose that the complete data pmf can 
be factored as p c d(x\9) = p(y\9)k(x\y, 6) such that 

(i) logp(y|#) is differentiable in 9 

(ii) E [ log k(X\y, 9) \ y, 9~\ is finite for all 9 

(Hi) D(k(-\y,9)\\k(-\y,9')) is differentiable with respect to 6' for fixed 6. 

(iv) D(k(-\y,8)\\k(-\y,9')) is continuous in 9 for fixed 9'. 

and suppose that p(y\9^ ') > 0. Then the likelihood p(y\9^ k ') is nondecreasing in k, and any limit 
point 9* of the sequence (6^') is a stationary point of the objective function p(y\9), which by 
definition means 

dp{y\9) 1 , v 

de \o=e* = 0. (5.8) 

Proof. Using the factorization p c d{x\9) = p(y\9)k(x\y,9), 

Q(9\9^) = E[logp cd (X\9)\y,9 {k) ] 

= \ogp(y\9) + E[ logk(X\y,0) \y,6^} 

= logp(y\6) + E[ log * { *\%, \y, (fc) ] + R 

k(X\y,9W) 

= logp(y\9)-D(k(-\y,9^)\\k(-\y,9)) + R, (5.9) 
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where 

R = E[ \ogk{X\y,9^) \y,0(% 

By assumption (ii), R is finite, and it depends on y and 9^ k \ but not on 9. Therefore, the maxi- 
mization step of the EM algorithm is equivalent to: 

6)( fc+1 ) = argmax \\ogp{y\9) - D(k(-\y, 9 {k) )\\k{-\y, 9))] (5.10) 

Thus, at each step, the EM algorithm attempts to maximize the log likelihood ratio \ogp{y\9) itself, 
minus a term which penalizes large differences between and 9^ > . 

The definition of 6»( fc+1 ) implies that Q(0(* +1 )|0W) > Q(9^\9^). Therefore, using (5.9) and 
the fact D(k(-\y,9^)\\k(-\y,9^)) = 0, yields 

logp(y\9^ k+1 ^ - D(k(-\y,9^ k+1 ^\\k(-\y,9^)) >\ogp(y\9^) (5.11) 

In particular, since the divergence is nonnegative, p{y\9^ ') is nondecreasing in k. Therefore, 
lim^oc log p(y\9^) exists. 

Suppose now that the sequence (9^ k >) has a limit point, 9*. By continuity, implied by the 
differentiability assumption (i), lim^-too p(y\@ ) — p(y\9*) < oo. For each k, 

< max \ogp(y\e)- D(k(-\y,9 {k) ) || k(-\y,$j) -\ogp(y\9 {k) ) (5.12) 

< \ogp(y\e ik+1) ) - logp(y|(9 (fc) ) -+ as k -> oo, (5.13) 

where (5.12) follows from the fact that 9^ k > is a possible value of 9 in the maximization, and the 
inequality in (5.13) follows from (5.10) and the fact that the divergence is always nonnegative. 
Thus, the quantity on the right-hand side of (5.12) converges to zero as k — > oo. So by continuity, 
for any limit point 6* of the sequence (9k), 

m^[\ogp(y\9)-D(k(-\y,9*) || k(-\y,9))} -logp(y\9*) = 

u 

and therefore, 

9* g argmax [\ogp(y\9) - D (k(-\y,9*) || k(-\y,d))] 

6 

So the derivative of \ogp(y\9) — D (k(-\y,9*) \\ k(-\y,9)) with respect to 9 at 9 = 9* is zero. The 
same is true of the term D (k(-\y, 9*) \\ k(-\y,9)) alone, because this term is nonnegative, it has 
value at 9 = 9* , and it is assumed to be differentiable in 9. Therefore, the derivative of the first 
term, log p(y\9), must be zero at 9*. I 

Remark 5.2.6 In the above proposition and proof, we assume that 9* is unconstrained. If there 
are inequality constraints on 9 and if some of them are tight for 9*, then we still find that if 
9* is a limit point of 9^ k \ then it is a maximizer of f(9) = \ogp(y\9) — D (k(-\y,9) \\ k(-\y,9*)) . 
Thus, under regularity conditions implying the existence of Lagrange multipliers, the Kuhn- Tucker 
optimality conditions are satisfied for the problem of maximizing f(9). Since the derivatives of 
D (k(-\y, 9) || k(-\y, 9*)) with respect to 9 at 9 = 9* are zero, and since the Kuhn- Tucker optimality 
conditions only involve the first derivatives of the objective function, those conditions for the 
problem of maximizing the true log likelihood function, log p(y\9), also hold at 9* . 
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5.3 Hidden Markov models 

A popular model of one-dimensional sequences with dependencies, explored especially in the context 
of speech processing, are the hidden Markov models. Suppose that 

X = (Y, Z), where Z is unobserved data and Y is the observed data 

Z = (Zi, . . . ,Zt) is a time- homogeneous Markov process, with one-step transition probability 
matrix A = (aij), and with Z\ having the initial distribution n. Here, T, with T > 1, denotes 
the total number of observation times. The state-space of Z is denoted by S, and the number 
of states of S is denoted by N s . 

Y = (Yi, . . . , Y n ) is the observed data. It is such that given Z — z, for some z = (zi, . . . , z n ), the 
variables Y\, ■ ■ ■ ,Y n are conditionally independent with P(Y t — l\Z — z) — b Zt j, for a given 
observation generation matrix B = (bn). The observations are assumed to take values in a 
set of size iV , so that B is an N s x N matrix and each row of B is a probability vector. 

The parameter for this model is 6 = (tt,A,B). The model is illustrated in Figure 5.1. The pmf of 

z z ? z z T 

A /-< A /-< A A ,— v 



Y Y Y Y 

1 2 3 T 



Figure 5.1: Structure of hidden Markov model, 
the complete data, for a given choice of 9, is 

T-l T 

Pcd(y,z\9) = ir Zl Y\ a ZuZt+1 Y[b Zuyt . (5.14) 

t=\ t=\ 

The correspondence between the pmf and the graph shown in Figure 5.1 is that each term on the 
right-hand side of (5.14) corresponds to an edge in the graph. 

In what follows we consider the following three estimation tasks associated with this model: 

1. Given the observed data and 9, compute the conditional distribution of the state (solved by 
the forward-backward algorithm) 

2. Given the observed data and 9, compute the most likely sequence for hidden states (solved 
by the Viterbi algorithm) 

3. Given the observed data, compute the maximum likelihood (ML) estimate of 9 (solved by the 
Baum-Welch/EM algorithm). 
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These problems are addressed in the next three subsections. As we will see, the first ol these 
problems arises in solving the third problem. The second problem has some similarities to the first 
problem, but it can be addressed separately. 

5.3.1 Posterior state probabilities and the forward-backward algorithm 

In this subsection we assume that the parameter = (n, A, B) of the hidden Markov model is 
known and fixed. We shall describe computationally efficient methods for computing posterior 
probabilites for the state at a given time t, or for a transition at a given pair of times t to t + 1, 
of the hidden Markov process, based on past observations (case of causal filtering) or based on 
past and future observations (case of smoothing). These posterior probabilities would allow us to 
compute, for example, MAP estimates of the state or transition of the Markov process at a given 
time. For example, we have: 

Z t\t M AP = ar gmaxP(Z t =i\Y 1 =y 1 ,...,Y t = y t , 6) (5.15) 

^t\T MAP = argmaxP(Z t = i\Y x = y u . . . ,Y T = y T , 9) (5.16) 

(Z t ,Z t+1 ) {T = arg max P(Z t = i, Z t+1 = j\Yi = y x , . . . ,Y T = y T ,6), (5.17) 

1 mjlr (i,j)£SxS 

where the conventions for subscripts is similar to that used for Kalman filtering: "t|T" denotes 
that the state is to be estimated at time t based on the observations up to time T. The key to 
efficient computation is to recursively compute certain quantities through a recursion forward in 
time, and others through a recursion backward in time. We begin by deriving a forward recursion 
for the variables ai{t) defined as follows: 

a i (t)=P(Y 1 = y 1 ,---,Yt = V t,Zt = i\6), 

for i 6 5 and 1 < t < T. The intial value is «i(l) = 7Tibi yi . By the law of total probability, the 
update rule is: 



,(£ + i) = Y t p O r i = y^---,Yt+i = vt+i,z t = i,z t+1 = j\e) 

= Y, p (Yi = yi,--- ,Yt = yt,z t = i\e) 



■ P(Z t +i = j, Y t+1 = yt+ilYi = yi, • • • , Y t = y t , Z t = i, 6) 

= / 4 a i\t) a ijUjyt+i- 

The right-hand side of (5.15) can be expressed in terms of the a's as follows. 

P(Z t = i,Y 1 =y 1 ,...,Y t = y t \e) 



P(Z t = i\Y 1 = y 1 ,...,Y t = y t ,8) 



P(Y 1 =y 1 ,...,Y t = y t \0) 

ot-i{t) 



(5.18) 
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The computation of the a's and the use of (5.18) is an alternative, and very similar to, the Kalman 
filtering equations. The difference is that for Kalman filtering equations, the distributions involved 
are all Gaussian, so it suffices to compute means and variances, and also the normalization in (5.18), 
which is done once after the a's are computed, is more or less done at each step in the Kalman 
filtering equations. 

To express the posterior probabilities involving both past and future observations used in (5.16), 
the following [3 variables are introduced: 

0i{t) = P(Y t+1 = y t+1 , ■■■ ,Y T = y T \Z t = i,6), 

for i e5 and 1 < t < T. The definition is not quite the time reversal of the definition of the a's, 
because the event Zt = i is being conditioned upon in the definition of f3i(t). This asymmetry is 
introduced because the presentation of the model itself is not symmetric in time. The backward 
equation for the /3's is as follows. The intial condition for the backward equations is f3i(T) = 1 for 
all i. By the law of total probability, the update rule is 

0i{t-\) = Y,P(Yt = yu--- ,Y T = y T ,Z t = j\Z t - 1 = i,e) 
= J2 p ( Y t = yt,Zt = j\z t -i = i,e) 

• P{Yt+i = yt,-- ,Yt = VT,\ z t = J> Y t = y t , Z t _i = i, 6) 
V^ aijb jVt Pj(t). 



jcs 



Note that 



P(Z t = i,Y 1 = y 1 ,...,Y T = y T \6) = P(Z t = i,Yy = y u . . . ,Y t = y t \0) 

■ P(Y t +i = yt+i, ■ ■ • , Y T = y T \6, Z t = i,Y 1 = yi,...,Y t = y t ) 
= P(Z t = i,Y 1 =y 1 ,...,Y t = y t \6) 

■ P{Y t+l = y t+1 , ...,Y T = y T \9, Z t = i) 
= on{t)Pi{t) 

from which we derive the smoothing equation for the conditional distribution of the state at a time 
t, given all the observations: 



7i(«) 



A 



P(Z t = i\Y x = yi , 
P(Z t = i,Y 1 = y 1 , 


. . . , 


Y t = 
,Y t = 


y T ,0) 
-- yr\0) 


P(Y 1 = yi ,... 

Oi(t)0i(t) 


,Y T 


= yi 


AG) 


£*=*«,■(*)&(*) 





The variable Ji(t) defined here is the same as the probability in the right-hand side of (5.16), so 
that we have an efficient way to find the MAP smoothing estimator defined in (5.16). For later 
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use, we note from the above that for any i such that 7i(i) > 0, 

P(Y 1 = y 1 ,...,Y T = y T \e)=°^§^. (5.19) 

Similarly, 

P(Z t = i, Z t+1 = j, Y 1 = y l ,...,Y T = y T \9) 
= P(Z t = i,Y 1 = y 1 ,...,Y t = Vt \e) 

• P(Zt+i = j, Yt+i = Vt+i\0, Z t = i,Y 1 = y 1 ,...,Y t = y t ) 
■ P{Yt+2 = Vt+2, ...,Y T = y T \9, Z t = i, Z t+1 = j, Y-y = yi, . . . , Y t+1 = y t+1 ) 
= <Xi(t)aijbj yt+1 Pj(t + 1), 

from which we derive the smoothing equation for the conditional distribution of a state-transition 
for some pair of consecutive times t and t + 1, given all the observations: 

Zij(t) = P(Z t = i,Z t+1 =j\Y 1 = y 1 ,...,Y T = y T ,6) 
P{Z t = i, Zt+i =j,Y 1 = y 1 ,...,Y T = y T \d) 
P(Y 1 = y 1 ,...,Y T = y T \$) 

Qt(*)o«j&jy f+1 /3j(*+l) 

J2i' tj > ai>(t)ai>fb fyt+1 l3 f (t + 1) 

W) 

where the final expression is derived using (5.19). The variable £ij(t) defined here is the same as 
the probability in the right-hand side of (5.17), so that we have an efficient way to find the MAP 
smoothing estimator of a state transition, defined in (5.17). 

Summarizing, the forward-backward or a — (3 algorithm for computing the posterior distribution 
of the state or a transition is given by: 



Algorithm 5.3.1 (The forward-backward algorithm) The a's can be recursively computed forward 
in time, and the (3's recursively computed backward in time, using: 

aj(t + 1) = 2, a i(.t)ciijbjy t+1 , with initial condition cti(l) = TTibi yi 
Pi(t — 1) = \, a ijbjy t /3j(t), with initial condition fii(T) = 1. 
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Then the posterior probabilities can be found: 

P(Z t = i\Y l =y 1 ,...,Y t = y t ,9) = - ai(t) - (5.20) 

EjesW) 

1i(t)±P(Z t = i\Y 1 = y 1 ,...,Y T = Vr ,e) = a ^^ (5.21) 

&jW = p (^< = hZt+i = J\Yi = yi,---,Y T = yr,0) = ^ ^ T o /. -.I 5 - 22 ) 



Ei'j' O»'(*)°i0'' 6 j'lfc+i/fy(* + 1) 



ft(*) 



(5.23) 



Remark 5.3.2 If the number of observations runs into the hundreds or thousands, the a's and /3's 
can become so small that underflow problems can be encountered in numerical computation. How- 
ever, the formulas (5.20), (5.21), and (5.22) for the posterior probabilities in the forward-backward 
algorithm are still valid if the a's and /3's are multiplied by time dependent (but state independent) 
constants (for this purpose, (5.22) is more convenient than (5.23), because (5.23) invovles f3's at 
two different times). Then, the a's and /3's can be renormalized after each time step of computation 
to have sum equal to one. Moreover, the sum of the logarithms of the normalization factors for the 
a's can be stored in order to recover the log of the likelihood, logp(y\9) = log^ i= ?Q~ a.i(T). 

5.3.2 Most likely state sequence — Viterbi algorithm 

Suppose the parameter 8 = (tt,A,B) is known, and that Y = (Y\, . . . ,Yp) is observed. In some 
applications one wishes to have an estimate of the entire sequence Z. Since 9 is known, Y and Z 
can be viewed as random vectors with a known joint pmf, namely p c d(y,z\0). For the remainder 
of this section, let y denote a fixed observed sequence, y = (y±, . . . ,yr)- We will seek the MAP 
estimate, Zmap(v, &), of the entire state sequence Z = (Z±, . . . , Zt), given Y = y. By definition, it 
is defined to be the z that maximizes the posterior pmf p(z\y, 9), and as shown in Section 5.1, it is 
also equal to the maximizer of the joint pmf of Y and Z: 

Zmap(v,9) = argmaxp cd (y,z\9) 

z 

The Viterbi algorithm (a special case of dynamic programming), described next, is a computation- 
ally efficient algorithm for simultaneously finding the maximizing sequence z* £ S T and computing 
Pcd(y,z*\9). It uses the variables: 

5i(t)= max P(Z 1 = zi,...,Z t _i = z t -i,Z t = i,Y 1 = yi,--- ,Y t = y t \0). 

(zi,...,z t _i)e«S'- 1 



The <5's can be computed by a recursion forward in time, using the initial values <5j(l) = ir(i)bi 



i in 
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and the recursion derived as follows: 
6j(t) = max max P(Z 1 = z\, . . . , Z t _ 2 = z t -2, Z t -i = i, Z t = j,Yi = yi, • • • ,Y t = y t \d) 

I {2l,...,Z t _ 2 } 

= max max P(Z 1 = zi, . . . , Z t _ 2 = z*-2, Zt-i = i,Yi = 2/i, • • • ,*t-i = yt-i\0)aijb jm 

I {2l,...,Z t _ 2 } 

= max {5i (t - 1 ) a,i j b jyt } 

i 

Note that 8i(T) = max z - ZT= ip cc i(y, z\9). Thus, the following algorithm correctly finds Zmap{u-,9). 

Algorithm 5.3.3 (Viterbi algorithm) Compute the S's and associated back pointers by a recursion 
forward in time: 

(initial condition) 5j(l) = ff(i)bi yi 
(recursive step) 8j(t) = max{<5j(t — l)a,ijbj yt } (5.24) 

i 

(storage of back pointers) <f>j(t) = argmax{<5j(£ — l)aijbj >yt } 

i 

Then z* = Zmap(u,9) satisfies p c d(y,z*\9) = maxj<5j(T), and z* is given by tracing backward in 
time: 

z* T = argmax^(T) and z* t _ x = <j> z *(t) for2<t<T. (5.25) 

i * 

5.3.3 The Baum- Welch algorithm, or EM algorithm for HMM 

The EM algorithm, introduced in Section 5.2, can be usefully applied to many parameter estimation 
problems with hidden data. This section shows how to apply it to the problem of estimating the 
parameter of a hidden Markov model from an observed output sequence. This results in the Baum- 
Welch algorithm, which was developed earlier than the EM algorithm, in the particular context of 
HMMs. 

The parameter to be estimated is 9 = (n, A, B). The complete data consists of (1", Z) whereas 
the observed, incomplete data consists of Y alone. The initial parameter 9^ = (tt^ > , A^ , B^ ') 
should have all entries strictly positive, because any entry that is zero will remain zero at the end 
of an iteration. Suppose #' ' is given. The first half of an iteration of the EM algorithm is to 
compute, or determine in closed form, Q(9\9^ k >). Taking logarithms in the expression (5.14) for the 
pmf of the complete data yields 



T-l 

\nrr r, _l_ \ l^vrr i, 

J *t,Vt 



\ogp cd (y,z\9) = \ogir Zl + ^2\oga ZuZt+1 + ^logfr 2 

t=i t=i 

Taking the expectation yields 

Q(e\dW) = E[log Pcd (y,Z\9)\y,9^] 

T-l T 

= ^2li( 1 ) l Og7Ti + ^2^2 &./(*) lo S °M + XIX! 7 *( f ) l0g bi M ' 

ieS t=l i,j t=l ieS 
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where the variables 7i(t) and £,i,j(t) are defined using the model with parameter 6*' '. In view of this 
closed form expression for Q(9\9^ k >), the expectation step of the EM algorithm essentially comes 
down to computing the 7's and the £'s. This computation can be done using the forward-backward 
algorithm, Algorithm 5.3.1, with 9 = 9^ k >. 

The second half of an iteration of the EM algorithm is to find the value of 9 that maximizes 
Q(6\Q^ k >), and set 9^ k+1 > equal to that value. The parameter 9 = (ir,A,B) for this problem can be 
viewed as a set of probability vectors. Namely, n is a probability vector, and, for each i fixed, a^ 
as j varies, and bu as I varies, are probability vectors. Therefore, Example 5.1.8 and Lemma 5.1.7 
will be of use. Motivated by these, we rewrite the expression found for Q{6\6^ ') to get 



r-i t 

Vt 



i£S i,j t=\ igS t=l 

= Yl 7i ( 1 ) lo ^ iTi+ Y2[Yl &j(*) ) log ai >'J 

i&S i,j \t=l / 

+ EE E*=l} l °Shl (5-26) 

i£S i \t=i / 

The first summation in (5.26) has the same form as the sum in Lemma 5.1.7. Similarly, for each i 
fixed, the sum over j involving aij, and the sum over I involving bit, also have the same form as 
the sum in Lemma 5.1.7. Therefore, the maximization step of the EM algorithm can be written in 
the following form: 

-f +1) = 7,(1) (5.27) 

%j ~ „ T _i . . (5.28) 

Et=i 7i(i) 

6 (h-d = gL^Hfa=o (5>29) 

The update equations (5.27)-(5.29) have a natural interpretation. Equation (5.27) means that the 
new value of the distribution of the initial state, 7P fe + 1 ) j j s simply the posterior distribution of the 
initial state, computed assuming 9^ k > is the true parameter value. The other two update equations 
are similar, but are more complicated because the transition matrix A and observation generation 
matrix B do not change with time. The denominator of (5.28) is the posterior expected number of 
times the state is equal to i up to time T — 1, and the numerator is the posterior expected number 
of times two consecutive states are i,j. Thus, if we think of the time of a jump as being random, 
the right-hand side of (5.28) is the time-averaged posterior conditional probability that, given the 
state at the beginning of a transition is i at a typical time, the next state will be j. Similarly, the 
right-hand side of (5.29) is the time-averaged posterior conditional probability that, given the state 
is i at a typical time, the observation will be I. 
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Algorithm 5.3.4 (Baum-Welch algorithm, or EM algorithm for HMM) Select the state space S, 

and in particular, the cardinality, N s , of the state space, and let 9^> denote a given initial choice 

of parameter. Given 6^ ', compute 0^- +1 > by using the forward-backward algorithm (Algorithm 

5.3.1) with 6 = 0^ k > to compute the j's and £'s. Then use (5.27)-(5.29) to compute 0( k+1 > = 
^(k+l)^ A (k+l) jB (k + l)y 

5.4 Notes 

The EM algorithm is due to A. P. Dempster, N.M. Laird, and B.D. Rubin [3]. The paper includes 
examples and a proof that the likelihood is increased with each iteration of the algorithm. An 
article on the convergence of the EM algorithm is given in [16]. Earlier related work includes that 
of Baum et al. [2], giving the Baum-Welch algorithm. A tutorial on inference for HMMs and 
applications to speech recognition is given in [11]. 

5.5 Problems 

5.1 Estimation of a Poisson parameter 

Suppose Y is assumed to be a Poi{0) random variable. Using the Bayesian method, suppose the 
prior distribution of 6 is the exponential distribution with some known parameter A > 0. (a) Find 
@MAp{k), the MAP estimate of 6 given that Y = k is observed, for some k > 0. 
(b) For what values of A is @MAp(k) ~ #ml(&)? (The ML estimator was found in Example 5.1.3.) 
Why should that be expected? 

5.2 A variance estimation problem with Poisson observation 

The input voltage to an optical device is X and the number of photons observed at a detector is 
N. Suppose A is a Gaussian random variable with mean zero and variance a 2 , and that given 
A, the random variable TV has the Poisson distribution with mean X 2 . (Recall that the Poisson 
distribution with mean A has probability mass function X n e~ /n\ for n > 0.) 

(a) Express P{N = n]} in terms of o 2 . You can express this is the last candidate. You do not have 
to perform the integration. 

(b) Find the maximum likelihood estimator of a 2 given N. (Caution: Estimate a 2 , not X. Be as 
explicit as possible-the final answer has a simple form. Hint: You can first simplify your answer to 
part (a) by using the fact that if A is a N(0,a 2 ) random variable, then i?[A 2n ] = - !2 n ' ■ ) 

5.3 Convergence of the EM algorithm for an example 

The purpose of this exercise is to verify for Example 5.2.2 that if 0^ > 0, then 0^ k > — > 0ml 

as k — > oo. As shown in the example, Oml = (y 2 — o~ 2 ) + - Let F(0) = ( e 2 J y 2 + jt^i so 

that the recursion (5.6) has the form Q( k+l > = F{6^ '). Clearly, over R + , F is increasing and 
bounded, (a) Show that is the only nonnegative solution of F{6) = 9 if y < a 2 and that and 
y — a 2 are the only nonnegative solutions of F(0) = if y > a 2 , (b) Show that for small > 0, 
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F{6) = 6+^^1 + 0^). (Hint: For < 9 < a\ ^ = J,^ = £(1 - £ + (J,) 2 -...)■ 
(c) Sketch F and argue, using the above properties of F, that if 9^°' > 0, then 9^ k > — > #ml- 

5.4 Transformation of estimators and estimators of transformations 

Consider estimating a parameter 9 £ [0, 1] from an observation Y. A prior density of 9 is available 
for the Bayes estimators, MAP and MMSE, and the conditional density of Y given 9 is known. 
Answer the following questions and briefly explain your answers. 

(a) Does 3 + 59 ML = (3+T#) ML ? 

(b) Does {9 ML f = (e s ) MJ J^ 

(c) Does 3 + 59map = (3 + 59) MAp ? 

(d) Does (9 map) 3 = (P)map]_ 

(e) Does 3 + 59 M mse = (3 + 59) MMSE ? 

(f) Does (9 M mse) 3 = (0 3 )mmse 7 

5.5 Using the EM algorithm for estimation of a signal variance 

This problem generalizes Example 5.2.2 to vector observations. Suppose the observation is Y = 
S + N, such that the signal S and noise N are independent random vectors in ~M, d . Assume that 
S is N(0, 91), and ./V is N(0, Sat), where 9, with 9 > 0, is the parameter to be estimated, / is the 
identity matrix, and Sjy is known. 

(a) Suppose 9 is known. Find the MMSE estimate of S, Smmse, and find an espression for the 
covariance matrix of the error vector, S — Smmse- 

(b) Suppose now that 9 is unknown. Describe a direct approach to computing 0ml(X)- 

(c) Describe how 9ml{Y) can be computed using the EM algorithm. 

(d) Consider how your answers to parts (b) and (c) simplify in case d = 2 and the covariance matrix 
of the noise, E^r, is the identity matrix. 

5.6 Finding a most likely path 

Consider an HMM with state space S = {0, 1}, observation space {0, 1,2}, and parameter 
9 = (-7T, A, B) given by: 

a a 3 \ ,-, / ca ca 2 ca? 



Here a and c are positive constants. Their actual numerical values aren't important, other than 
the fact that a < 1. Find the MAP state sequence for the observation sequence 021201, using the 
Viterbi algorithm. Show your work. 

5.7 State estimation for an HMM with conditionally Gaussian observations 

Consider a discrete-time Markov process Z = (Z±, Z2, Z3, Z4) with state-space {0,1,2}, initial 
distribution (i.e. distribution of Z\) ir = (c2 -3 , c, c2 -5 ) (where c > and its numerical value is not 
relevant), and transition rate diagram shown. 
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(a) Place weights on the edges of the trellis below so that the minimum sum of weights along a 
path in the trellis corresponds to the most likely state sequence of length four. That is, you are to 
use the Viterbi algorithm approach to find z* = (z^, z^z^, z\) that maximizes P{(Z\, Z2, Z3, Z4) = 
(zi, Z2, Z3, Z4)} over all choices of (z\, 22, £3, z\). Also, find z*. (A weight i can represent a probability 
2 - *, for example. 




t=l t=2 t=3 t=4 

(b) Using the same statistical model for the process Z as in part (a), suppose there is an observation 
sequence (Yt : 1 < t < 4) with Yj = Zt + Wt, where W\, W2, W3, W4 are iV(0, a 2 ) random variables 
with ^2 = hi 2. (This choice of a 2 simplifies the problem.) Suppose Z, W\, W2, W3, W4 are mutually 
independent. Find the MAP estimate Zmap{v) of (Zi, Z2, Z3, Z4) for the observation sequence 
y = (2, 0, 1, —2). Use an approach similar to part (a), by placing weights on the nodes and edges of 
the same trellis so that the MAP estimate is the minimum weight path in the trellis. 

5.8 Estimation of the parameter of an exponential in additive exponential noise 

Suppose an observation Y has the form Y = Z + N, where Z and iV are independent, Z has the 
exponential distribution with parameter 9, N has the exponential distribution with parameter one, 
and 9 > is an unknown parameter. We consider two approaches to finding 9ml{v)- 

9 e -»+(i-0)* < z < y 



(a) Show that f cd (y,z\9) , () ^ 

(b) Find f{y\9). The direct approach to finding 9Mi{y) is to maximize f(y\6) (or its log) with 
respect to 9. You needn't attempt the maximization. 

(c) Derive the EM algorithm for finding 0ML{y)- You may express your answer in terms of the 
function <j> defined by: 



<f>(y,d) = E[Z\y,e] 



You needn't implement the algorithm. 

(d) Suppose an observation Y = (Y±, . . . , Yt) has the form Y = Z + N, where Z = (Z\, . . . , Zt) 





y 


9 + \ 


1 


exp((0-l)2/)-l 




y 

2 


9 = 1 
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and N = (Aq, . . . , Nt), such that Aq, . . . , Nt, Z\, . . . Zp are mutually independent, and for each t, 
Zf has the exponential distribution with parameter 9, and Nt has the exponential distribution with 
parameter one, and 9 > is an unknown parameter. Note that 9 does not depend on t. Derive the 
EM algorithm for finding 9ml(v)- 

5.9 Estimation of a critical transition time of hidden state in HMM 

Consider an HMM with unobserved data Z = [Z\, . . . , Zt), observed data Y = (Yi, . . . , Yp), and 
parameter vector 9 = (it, A, B). Let F C S, where S is the statespace of the hidden Markov process 
Z, and let Tp be the first time t such that Z± G F with the convention that Tp = T + 1 if {Zt F 
for 1 < t < T). 

(a) Describe how to find the conditional distribution of Tp given Y, under the added assumption 
that (ajj = for all (i,j) such that i £ F and j F), i.e. under the assumption that F is an 
absorbing set for Z. 

(b) Describe how to find the conditional distribution of Tp given Y, without the added assumption 
made in part (a). 

5.10 Maximum likelihood estimation for HMMs 

Consider an HMM with unobserved data Z = (Z±, . . . ,Zp), observed data Y = (Y±, . . . ,Yp), 
and parameter vector 9 = (n,A,B). Explain how the forward-backward algorithm or the Viterbi 
algorithm can be used or modified to compute the following: 

(a) The ML estimator, Zml, of Z based on Y, assuming any initial state and any transitions i — > j 
are possible for Z. (Hint: Your answer should not depend on n or A.) 

(b) The ML estimator, Zml, of Z based on Y, subject to the constraints that Zml takes values in 
the set {z : P{Z = z} > 0}. (Hint: Your answer should depend on n and A only through which 
coordinates of n and A are nonzero.) 

(c) The ML estimator, Z\^ml, of Z\ based on Y. 

(d) The ML estimator, Z tot ML, of Z to based on Y, for some fixed t with 1 < t < T. 

5.11 An underconstrained estimation problem 

Suppose the parameter 9 = (n,A,B) for an HMM is unknown, but that it is assumed that the 
number of states N s in the statespace S for (Zt) is equal to the number of observations, T. Describe 
a trivial choice of the ML estimator 9ml{v) for a given observation sequence y = (y\, . . . , yp). What 
is the likelihood of y for this choice of 91 

5.12 Specialization of Baum- Welch algorithm for no hidden data 

(a) Determine how the Baum- Welch algorithm simplifies in the special case that B is the identity 
matrix, so that Xt = Yt for all t. (b) Still assuming that B is the identity matrix, suppose that 
S = {0, 1} and the observation sequence is 0001110001110001110001. Find the ML estimator for n 
and A. 

5.13 Free energy and the Boltzmann distribution 

Let S denote a finite set of possible states of a physical system, and suppose the (internal) energy 
of any state s G S is given by V(s) for some function V on S. Let T > 0. The Helmholtz free 
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energy of a probability distribution Q on S is denned to be the average (internal) energy minus the 
temperature times entropy: F(Q) = 2jQ(*)V(i) + T ^2 { Q(i) log Q(i). Note that F is a convex 
function of Q. (We're assuming Boltzmann's constant is normalized to one, so that T should 
actually be in units of energy, but by abuse of notation we will call T the temperature.) 

(a) Use the method of Lagrange multipliers to show that the Boltzmann distribution defined by 
Br(i) — z(T) ex P(~^(^)/^) minimizes F(Q). Here Z(T) is the normalizing constant required to 
make Bt a probability distribution. 

(b) Describe the limit of the Boltzmann distribution as T — > oo. 

(c) Describe the limit of the Boltzmann distribution as T — > 0. If it is possible to simulate a random 
variable with the Boltzmann distribution, does this suggest an application? 

(d) Show that F{Q) = TD(Q\\Bt) + (term not depending on Q). Therefore, given an energy 
function V on S and temperature T > 0, minimizing free energy over Q in some set is equivalent 
to minimizing the divergence D(Q\\Bt) over Q in the same set. 

5.14 Baum- Welch saddlepoint 

Suppose that the Baum- Welch algorithm is run on a given data set with initial parameter 9^°' = 
(7P°), A^°\ B^ ') such that 7P ^ = tt^'A^ ' (i.e., the initial distribution of the state is an equilibrium 
distribution of the state) and every row of B^ ) is identical. Explain what happens, assuming an 
ideal computer with infinite precision arithmetic is used. 

5.15 Inference for a mixture model 

(a) An observed random vector Y is distributed as a mixture of Gaussian distributions in d dimen- 
sions. The parameter of the mixture distribution is 8 = (6q, . . . ,9j), where 0j is a d- dimensional 
vector for 1 < j < J. Specifically, to generate Y a random variable Z, called the class label for 
the observation, is generated. The variable Z is uniformly distributed on {1, . . . , J}, and the con- 
ditional distribution of Y given {9, Z) is Gaussian with mean vector 9z and covariance the d x d 
identity matrix. The class label Z is not observed. Assuming that 9 is known, find the posterior 
pmf p(z\y, 9). Give a geometrical interpretation of the MAP estimate Z for a given observation 
Y = y. 

(b) Suppose now that the parameter 9 is random with the uniform prior over a very large region 
and suppose that given 9, n random variables are each generated as in part (a), independently, to 
produce 

(Z^ l \ Y^ l > , Z^ 2 ', Y^ 2 ', . . . , Z^ n > , Y^ n >). Give an explicit expression for the joint distribution 
P(e,z<V,yW,z<> 2 \yV>,...,z<> n \yW). 

(c) The iterative conditional modes (ICM) algorithm for this example corresponds to taking turns 
maximizing P(6, zS l \ y^ 1 ',^ 2 ', y^ 2 ' , . . . , zS n \ y^ n ') with respect to 9 for z fixed and with respect to z 
for 9 fixed. Give a simple geometric description of how the algorithm works and suggest a method 
to initialize the algorithm (there is no unique answer for the later). 

(d) Derive the EM algorithm for this example, in an attempt to compute the maximum likelihood 
estimate of 9 given y^ l \ y^ 2 ', ■ ■ ■ , y^ n ' ■ 

5.16 Constraining the Baum- Welch algorithm 

The Baum- Welch algorithm as presented placed no prior assumptions on the parameters n, A, B, 
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other than the number of states N s in the state space of (Zt). Suppose matrices A and B are given 
with the same dimensions as the matrices A and B to be esitmated, with all elements of A and 
B having values and 1. Suppose that A and B are constrained to satisfy A < A and B < B, in 
the element-by-element ordering (for example, Ojj < a^- for all i,j.) Explain how the Baum- Welch 
algorithm can be adapted to this situation. 

5.17 * Implementation of algorithms 

Write a computer program to (a) simulate a HMM on a computer for a specified value of the 
paramter = (n, A, B), (b) To run the forward-backward algorithm and compute the a's, /3's, 7's, 
and £'s , (c) To run the Baum- Welch algorithm. Experiment a bit and describe your results. For 
example, if T observations are generated, and then if the Baum- Welch algorithm is used to estimate 
the paramter, how large does T need to be to insure that the estimates of 9 are pretty accurate. 
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Chapter 6 

Dynamics of Countable-State Markov 
Models 



Markov processes are useful for modeling a variety of dynamical systems. Often questions involving 
the long-time behavior of such systems are of interest, such as whether the process has a limiting 
distribution, or whether time-averages constructed using the process are asymptotically the same 
as statistical averages. 

6.1 Examples with finite state space 

Recall that a probability distribution n on S is an equilibrium probability distribution for a time- 
homogeneous Markov process X if n = irH(t) for all t. In the discrete-time case, this condition 
reduces to tt = ttP. We shall see in this section that under certain natural conditions, the existence 
of an equilibrium probability distribution is related to whether the distribution of X(t) converges 
as t — > oo. Existence of an equilibrium distribution is also connected to the mean time needed for 
X to return to its starting state. To motivate the conditions that will be imposed, we begin by 
considering four examples of finite state processes. Then the relevant definitions are given for finite 
or countably-infinite state space, and propositions regarding convergence are presented. 

Example 6.1.1 Consider the discrete-time Markov process with the one-step probability diagram 
shown in Figure 6.1. Note that the process can't escape from the set of states Si = {a,b,c,d,e}, 
so that if the initial state X(0) is in S\ with probability one, then the limiting distribution is 
supported by Si. Similarly if the initial state -X"(0) is in S2 = {/, g, h} with probability one, then 
the limiting distribution is supported by 1S2. Thus, the limiting distribution is not unique for this 
process. The natural way to deal with this problem is to decompose the original problem into two 
problems. That is, consider a Markov process on Si, and then consider a Markov process on S2. 

Does the distribution of -X"(0) necessarily converge if X(0) G Si with probability one? The 
answer is no. For example, note that if X(0) = a, then X(k) G {a,c,e} for all even values of k, 
whereas X{k) G {b, d} for all odd values of k. That is, ix a {k) + ir c (k) + n e (k) is one if k is even and 
is zero if k is odd. Therefore, if 7r a (0) = 1, then n(k) does not converge as k — > 00. 
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Figure 6.1: A one-step transition probability diagram with eight states. 



Basically speaking, the Markov process of Example 6.1.1 fails to have a unique limiting distri- 
bution independent of the initial state for two reasons: (i) the process is not irreducible, and (ii) 
the process is not aperiodic. 



Example 6.1.2 Consider the two-state, continuous time Markov process with the transition rate 
diagram shown in Figure 6.2 for some positive constants a and 0. This was already considered in 
Example 4.9.3, where we found that for any initial distribution 7r(0), 



--Q 



Figure 6.2: A transition rate diagram with two states. 



lim 7r(t) = lim ir(0)H(t) 







n 



a + [3 a + f3 



The rate of convergence is exponential, with rate parameter a + f3, which happens to be the nonzero 
eigenvalue of Q. Note that the limiting distribution is the unique probability distribution satisfying 
ttQ = 0. The periodicity problem of Example 6.1.1 does not arise for continuous-time processes. 



Example 6.1.3 Consider the continuous-time Markov process with the transition rate diagram in 
Figure 6.3. The Q matrix is the block-diagonal matrix given by 
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Figure 6.3: A transition rate diagram with four states. 



This process is not irreducible, but rather the transition rate diagram can be decomposed into two 
parts, each equivalent to the diagram for Example 6.1.2. The equilibrium probability distributions 
are the probability distributions of the form it = yX-^ra, ^^xg 5 (1 — ^)^+/3 
is the probability placed on the subset {1, 2}. 



A^, A^, (1 - A)^, (1 - A)^), where A 



Example 6.1.4 Consider the discrete-time Markov process with the transition probability diagram 
in Figure 6.4. The one-step transition probability matrix P is given by 




Figure 6.4: A one-step transition probability diagram with three states. 



1 

1 

1 

Solving the equation ix = nP we find there is a unique equilibrium probability vector, namely 



( 



l l l- 
3' 3' 3< 



On the other hand, if 7r(0) = (1, 0, 0), then 



7r(fc) = tt(0)P* 



(1,0,0) if k = mod 3 
(0,1,0) if k = 1 mod 3 
(0,0,1) if k = 2 mod 3 



Therefore, n(k) does not converge as k — > oo. 



6.2 Classification and convergence of discrete-time Markov pro- 
cesses 



The following definition applies for either discrete time or continuous time. 

Definition 6.2.1 Let X be a time-homogeneous Markov process on the countable state space S. 
The process is said to be irreducible if for all i,j £ S, there exists s > so that Pij(s) > 0. 
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The next definition is relevant only for discrete-time processes. 

Definition 6.2.2 The period of a state i is defined to be GCD{k > : pu(k) > 0}, where "GCD" 
stands for greatest common divisor. The set {k > : pu(k) > 0} is closed under addition, which by 
a result in elementary algebra 1 implies that the set contains all sufficiently large integer multiples 
of the period. The Markov process is called aperiodic if the period of all the states is one. 

Proposition 6.2.3 If X is irreducible, all states have the same period. 

Proof. Let i and j be two states. By irreducibility, there are integers k\ and &2 so that pij{k\) > 
and Pji{k2) > 0. For any integer n, pu(n + k\ + /C2) > Pij(ki)pjj(n)pji(k2), so the set {k > : 
Pu(k) > 0} contains the set {k > : Pjj(k) > 0} translated up by k\ + fe- Thus the period of i is 
less than or equal to the period of j. Since i and j were arbitrary states, the proposition follows. I 

For a fixed state i, define Tj = minjA; > 1 : X{k) = i}, where we adopt the convention that the 
minimum of an empty set of numbers is +00. Let Mj = .E[tj|X(0) = i\. If P{ti < +oo|X(0) — i) < 
1, state i is called transient (and by convention, Mj = +00). Otherwise P(tj < +oo|X(0) = i) = 1, 
and i is said to be positive recurrent if Mj < +00 and to be null recurrent if Mj = +00. 

Proposition 6.2.4 Suppose X is irreducible and aperiodic. 

(a) All states are transient, or all are positive recurrent, or all are null recurrent. 

(b) For any initial distribution 7r(0) ; lim^oo 7Tj(£) = 1/Mj, with the understanding that the limit 

is zero if Mi = +00. 

(c) An equilibrium probability distribution n exists if and only if all states are positive recurrent. 

(d) If it exists, the equilibrium probability distribution n is given by 7Tj = 1/Mj. (In particular, if 

it exists, the equilibrium probability distribution is unique). 

Proof, (a) Suppose state i is recurrent. Given X(0) = i, after leaving i the process returns to 
state i at time T\. The process during the time interval {0, . . . , tj} is the first excursion of X from 
state 0. From time T\ onward, the process behaves just as it did initially. Thus there is a second 
excursion from i, third excursion from i, and so on. Let T^ for k > 1 denote the length of the kth 
excursion. Then the T^'s are independent, and each has the same distribution as T\ = Tj. Let j be 
another state and let e denote the probability that X visits state j during one excursion from i. 
Since X is irreducible, e > 0. The excursions are independent, so state j is visited during the fcth 
excursion with probability e, independently of whether j was visited in earlier excursions. Thus, the 
number of excursions needed until state j is reached has the geometric distribution with parameter 
e, which has mean 1/e. In particular, state j is eventually visited with probability one. After j 
is visited the process eventually returns to state i, and then within an average of 1/e additional 
excursions, it will return to state j again. Thus, state j is also recurrent. Hence, if one state is 
recurrent, all states are recurrent. 



1 Such as the Euclidean algorithm, Chinese remainder theorem, or Bezout theorem 
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The same argument shows that if i is positive recurrent, then j is positive recurrent. Given 
X(0) = i, the mean time needed for the process to visit j and then return to i is Mj/e, since on 
average 1/e excursions of mean length Mi are needed. Thus, the mean time to hit j starting from 
i, and the mean time to hit i starting from j, are both finite. Thus, j is positive recurrent. Hence, 
if one state is positive recurrent, all states are positive recurrent. 

(b) Part (b) of the proposition follows by an application of the renewal theorem, which can be 
found in [1]. 

(c) Suppose all states are positive recurrent. By the law of large numbers, for any state j, the 
long run fraction of time the process is in state j is 1/Mj with probability one. Similarly, for any 
states i and j, the long run fraction of time the process is in state j is jij/Mi, where 7^ is the 
mean number of visits to j in an excursion from i. Therefore 1/Mj = 'jij/Mi. This implies that 
^2i l/Mi = 1. That is, tt defined by 7Tj = 1/Mj is a probability distribution. The convergence for 
each i separately given in part (b), together with the fact that tt is a probability distribution, imply 
that Yli l 71 "*^) ~ n i\ ~^ 0- Thus, taking s to infinity in the equation n(s)H(t) = tt(s + t) yields 
nH(t) = it, so that tt is an equilibrium probability distribution. 

Conversely, if there is an equilibrium probability distribution tt, consider running the process 
with initial state tt. Then n(t) = tt for all t. So by part (b), for any state i, 7Tj = 1/Mj. Taking a 
state i such that 7Tj > 0, it follows that Mj < 00. So state i is positive recurrent. By part (a), all 
states are positive recurrent. 

(d) Part (d) was proved in the course of proving part (c). ■ 

We conclude this section by describing a technique to establish a rate of convergence to the 
equilibrium distribution for finite-state Markov processes. Define 5(P) for a one-step transition 
probability matrix P by 

8(P) = minyVj Ap k j, 

i,k *■ — ' 
3 

where a Ab = min{a, b}. The number 6(P) is known as Dobrushin's coefficient of ergodicity. Since 
a + b — 2(a A b) = \a — b\ for a, b > 0, we also have 



1 - 2S(P) = min'y]\p i j - p kj \. 

i,k *■ — ' 
3 

Let ||/z||i for a vector \x denote the L\ norm: ||/x||i = J^ |//j|. 

Proposition 6.2.5 For any probability vectors it and a, \\ttP — o~P\\i < (1 — <5(P))||7r — <r||i. 
Furthermore, if 8{P) > there is a unique equilibrium distribution tt 00 , and for any other probability 
distribution tt on S, \\ttP 1 - 7r°°||i < 2(1 - 5(P)) 1 . 

Proof. Let tt-i = 7Tj — ttj A <7j and 5"j = o~\ — tti A o; L . Note that if 7Tj > o~\ then tti = tt-i — oi 
and di = 0, and if 7Tj < o~i then d\ = a% — iTi and iti = 0. Also, ||-7r||i and ||a||i are both equal to 



172 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS 

1 — Y\- 77V A a,-. Therefore, ||?r — alh = \\n — a||i = 2||7r||i = 2||a||i. Furthermore, 

/ J f v L 'II I I -*- !! II- 1 - I || -L ||||-L 1 

||ttP-ctP||i = ||7fP-aP||i 

j i k 

< (l/llTfUO^^a-fe^lP^-P^I 

< ||7r||i(2-25(P)) = || 7 r-a||i(l-<5(P)), 

which proves the first part of the proposition. Iterating the inequality just proved yields that 

\\1rP 1 - aP'||i < (1 - S(P)) l \\ir - a||i < 2(1 - 5(P)) 1 . (6.1) 

This inequality for a = i\P n yields that \\tvP 1 — 7rP z+n ||i < 2(1 — 8(P)) 1 . Thus the sequence ttP 1 
is a Cauchy sequence and has a limit 7r°°, and 7r°°P = tt°°. Finally, taking a in (6.1) equal to tt°° 
yields the last part of the proposition. I 

Proposition 6.2.5 typically does not yield the exact asymptotic rate that ||-7r' — 7r°°||i tends to 
zero. The asymptotic behavior can be investigated by computing (7 — zP) _1 , and matching powers 
of z in the identity (7 - zP)~ l = £~ =0 z n P n . 

6.3 Classification and convergence of continuous-time Markov pro- 
cesses 

Chapter 4 discusses Markov processes in continuous time with a finite number of states. Here 
we extend the coverage of continuous-time Markov processes to include countably infinitely many 
states. For example, the state of a simple queue could be the number of customers in the queue, 
and if there is no upper bound on the number of customers that can be waiting in the queue, the 
state space is Z+. One possible complication, that rarely arises in practice, is that a continuous 
time process can make infinitely many jumps in a finite amount of time. 

Let S be a finite or countably infinite set, and let A S. A pure-jump function is a function 
x : R+ -^ 5U{A} such that there is a sequence of times, = to < T\ < . . . , and a sequence of 
states, so,s\, . . . with Sj £ S, and Sj / Sj+i, i > 0, so that 

/.n f Si if n < t < n+i i > . , 

x(t) = j A if t > r * (6.2) 

where r* = lim^oo T{. If r* is finite it is said to be the explosion time of the function x, and if 
r* = +00 the function is said to be nonexplosive. The example corresponding to S = {0, 1, . . .}, 
n = i/(i + 1) and Si = i is pictured in Fig. 6.5. Note that r* = 1 for this example. 
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x(t) 



J LUL 



Ti ^ '■"C 



Figure 6.5: A pure-jump function with an explosion time. 

Definition 6.3.1 A pure-jump Markov process (Xt : t > 0) is a Markov process such that, with 
probability one, its sample paths are pure-jump functions. Such a process is said to be nonexplosive 
if its sample paths are nonexplosive, with probability one. 

Generator matrices are defined for countable-state Markov processes just as they are for finite- 
state Markov processes. A pure-jump, time-homogeneous Markov process X has generator matrix 

Q = (qij : «,i e S) if 



Hm(p«(/0 



I{i=j})/h = qij i,jeS 



or equivalently 



Pij(h) = I{ i= j} + hqij + o(h) i,j GS 



(6.3) 



(6.4) 



where o(h) represents a quantity such that lim/^o o{h)/h = 0. 

The space-time properties for continuous-time Markov processes with a countably infinite num- 
ber of states are the same as for a finite number of states. There is a discrete-time jump process, 
and the holding times, given the jump process, are exponentially distributed. Also, the following 
holds. 



Proposition 6.3.2 Given a matrix Q = (qij : i,j € S) satisfying qij > for distinct states i and j, 
and qa = — 2 7& s 7 --^ qij for each state i, and a probability distribution 7r(0) = (^(0) : i € S), there 
is a pure-jump, time-homogeneous Markov process with generator matrix Q and initial distribution 
7r(0). The finite- dimensional distributions of the process are uniquely determined by ir(0) and 
Q. The Chapman- Kolmogorov equations, H(s,t) = H(s,T)H(r,t), and the Kolmogorov forward 



equations, 



dt 



Y^ies^iitfciij, hold. 



Example 6.3.3 (Birth-death processes) A useful class of countable-state Markov processes is the 
set of birth-death processes. A (continuous-time) birth-death process with parameters (Ao, A2, ■ ■ •) 
and (hi, H2, ■ ■ •) (also set A_i = (jlq = 0) is a pure-jump Markov process with state space S = "L + 
and generator matrix Q defined by qtk+i = ^k,qkk = —(f-k + Afc), and qkk-i = Hk for k > 0, and 
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qij = if \i — j\ > 2. The transition rate diagram is shown in Fig. 6.6. The space-time structure, as 
defined in Section 4.10, of such a process is as follows. Given the process is in state k at time t, the 
next state visited is k + 1 with probability Xk/i^k + f-k) and k—l with probability jUjt/(Afe+/Xfe). The 
holding time of state k is exponential with parameter A& + //&. The Kolmogorov forward equations 
for birth-death processes are 

— — — = A fc -i7T fc _i(t) - (Afc + Atfc)7Tfc(t) + /ifc+i7r fc+ i(t) (6.5) 



Example 6.3.4 (Description of a Poisson process as a Markov process) Let A > and consider a 
birth-death process N with A& = A and fik = for all fc, with initial state zero with probability one. 
The space-time structure of this Markov process is rather simple. Each transition is an upward 
jump of size one, so the jump process is deterministic: N (k) = k for all k. Ordinarily, the holding 
times are only conditionally independent given the jump process, but since the jump process is 
deterministic, the holding times are independent. Also, since q^k = —A for all k, each holding time 
is exponentially distributed with parameter A. Therefore, TV satisfies condition (b) of Proposition 
4.5.2, so that N is a Poisson process with rate A. 

Define, for i € S, r? = min{£ > : X(t) / i}, and n = min{i > t° : X(t) = i}. Thus, if 
X(0) = i, n is the first time the process returns to state i, with the exception that r« = +oo if the 
process never returns to state i. The following definitions are the same as when X is a discrete- 
time process. Let Mi = ^[^^(O) = i\. If P{n < +00} < 1, state i is called transient. Otherwise 
P\ji < +00} = 1, and i is said to be positive recurrent if Mj < +c>o and to be null recurrent if 
Mi = +00. The following propositions are analogous to those for discrete-time Markov processes. 
Proofs can be found in [1, 10]. 

Proposition 6.3.5 Suppose X is irreducible. 

(a) All states are transient, or all are positive recurrent, or all are null recurrent. 

(b) For any initial distribution ir(Q), limt^ +00 7Tj(£) = l/(—quMi), with the understanding that the 

limit is zero if Mi = +00. 

Proposition 6.3.6 Suppose X is irreducible and nonexplosive. 



m-1 ^2 ^3 m 



Figure 6.6: Transition rate diagram of a birth-death process. 
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(a) A probability distribution tt is an equilibrium distribution if and only if ttQ = 0. 

(b) An equilibrium probability distribution exists if and only if all states are positive recurrent. 

(c) If all states are positive recurrent, the equilibrium probability distribution is given by in = 

l/{—quMi). (In particular, if it exists, the equilibrium probability distribution is unique). 

The assumption that X be nonexplosive is needed for Proposition 6.3.6(a) (per Problem 6.9), 
but the following proposition shows that the Markov processes encountered in most applications 
are nonexplosive. 

Proposition 6.3.7 Suppose X is irreducible. Fix a state i and for k > 1 let S^ denote the set of 
states reachable from i in k jumps. Suppose for each k > 1 there is a constant 7^ so that the jump 
intensities on Sk are bounded by 7^, that is, suppose —qu < 7^ for i £ S^. If SfcLi V~ = +°°> ^ e 
process X is nonexplosive. 

6.4 Classification of birth-death processes 

The classification of birth-death processes, introduced in Example 6.3.3, is relatively simple. To 
avoid trivialities, consider a birth-death process such that the birth rates, (Aj : i > 0) and death 
rates (fn : i > 1) are all strictly positive. Then the process is irreducible. 

First, investigate whether the process is nonexplosive, because this is a necessary condition for 
both recurrence and positive recurrence. This is usually a simple matter, because if the rates are 
bounded or grow at most linearly, the process is nonexplosive by Proposition 6.3.7. In some cases, 
even if Proposition 6.3.7 doesn't apply, it can be shown by some other means that the process 
is nonexplosive. For example, a test is given below for the process to be recurrent, and if it is 
recurrent, it is not explosive. 

Next, investigate whether X is positive recurrent. Suppose we already know that the process 
is nonexplosive. Then the process is positive recurrent if and only if ttQ = for some probability 
distribution n, and if it is positive recurrent, tt is the equilibrium distribution. Now ttQ = if and 
only if flow balance holds for any state k: 

(Afc + /"fc)vTfc = Xk-lTTk-1 + fJ.k+l^k+1- (6-6) 

Equivalently, flow balance must hold for all sets of the form {0, ... ,n — 1} (just sum each side 
of (6.6) over k £ {1, . . . ,n — 1}). Therefore, ttQ = if and only if 7r n _iA n _i = 7r„^ n for n > 1, 
which holds if and only if there is a probability distribution tt with 7r n = 7roAo • • • A n _i/(/ii . . . ^ n ) 
for n > 1. Thus, a probability distribution tt with ttQ = exists if and only if S\ < +00, where 

*i = f;^^, (6.7) 

with the understanding that the i = term in the sum defining Si is one. Thus, under the 
assumption that X is nonexplosive, X is positive recurrent if and only if Si < 00, and if X is 
positive recurrent, the equilibrium distribution is given by 7r n = (Aq • • • A n _i)/(Si^i . . . fj, n ). 



176 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS 

Finally, investigate whether X is recurrent. This step is not necessary if we already know that 
X is positive recurrent, because a positive recurrent process is recurrent. The following test for 
recurrence is valid whether or not X is explosive. Since all states have the same classification, 
the process is recurrent if and only if state is recurrent. Thus, the process is recurrent if the 
probability the process never hits 0, for initial state 1, is zero. We shall first find the probability 
of never hitting state zero for a modified process, which stops upon reaching a large state n, and 
then let n — > oo to find the probability the original process never hits state 0. Let 6« n denote the 
probability, for initial state i, the process does not reach zero before reaching n. Set the boundary 
conditions, &on = and b nn = 1. Fix i with 1 < i < n — 1, and derive an expression for bi n by first 
conditioning on the state reached by the first jump of the process, starting from state i. By the 
space-time structure, the probability the first jump is up is Aj/(Aj + ft) and the probability the 
first jump is down is ft/(Aj + ft;). Thus, 

%+l,n + T j &i-l,nj 



Aj -+- ft M -+- ft 

which can be rewritten as ft(&j n — 6j-i, n ) = Ai(h+i,n — h,n)- I n particular, &2n — b\ n = bi n (j,\/\\ 
and &3 n — &2n — bi n /j,iH2/ (^1X2) , and so on, which upon summing yields the expression 

k—l 

, . V^ ^ X ^ 2 • • • Vi 
Okn ~ 0\ n / , V~ : T-- 

r-* MM ■ ■ ■ M 

with the convention that the i = term in the sum is one. Finally, the condition b nn = 1 yields 
the solution 

bin = i • (6.8) 

^i=0 A1A2...A; 

Note that b\ n is the probability, for initial state 1, of the event B n that state n is reached without 
an earlier visit to state 0. Since B n+ \ C B n for all n > 1, 



n— >oo 



P(n n >iS n |X(0) = 1) = _lim_ b ln = 1/S 2 (6.9) 

where 



S 2 



00 

E /Xj^2 ■■■ft 
n A1A2...V 



with the understanding that the i = term in the sum defining 1S2 is one - Due to the definition of 
pure jump processes used, whenever X visits a state in S the number of jumps up until that time 
is finite. Thus, on the event (l n >iB n , state zero is never reached. Conversely, if state zero is never 
reached, either the process remains bounded (which has probability zero) or n n >\B n is true. Thus, 
P(zero is never reached|X(0) = 1) = 1/52- Consequently, X is recurrent if and only if S2 = 00. 
In summary, the following proposition is proved. 

Proposition 6.4.1 Suppose X is a continuous-time birth-death process with strictly positive birth 
rates and death rates. If X is nonexplosive (for example, if the rates are bounded or grow at most 
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linearly with n, or if S2 = 00J then X is positive recurrent if and only if Si < +00. If X is positive 
recurrent the equilibrium probability distribution is given by 7r„ = (Ao • • • A n _i)/(Si/ii . . . /i n ). 
The process X is recurrent if and only if S2 = 00. 

Discrete-time birth-death processes have a similar characterization. They are discrete-time, 
time-homogeneous Markov processes with state space equal to the set of nonnegative integers. Let 
nonnegative birth probabilities (A& : k > 0) and death probabilities (/Zfe : k > 1) satisfy Ao < 1, and 
Afc + /ifc < 1 for k > 1. The one-step transition probability matrix P = {pij : i,j > 0) is given by 



Pij 



Xi 


if j = i + l 


IH 


if 3 = i ~ 1 


— Xi — jj,i 


if 3 = i > 1 


1-Ao 


if j = i = 





else. 



(6.10) 



Implicit in the specification of P is that births and deaths can't happen simultaneously. If the birth 
and death probabilities are strictly positive, Proposition 6.4.1 holds as before, with the exception 
that the discrete-time process cannot be explosive. 2 

6.5 Time averages vs. statistical averages 

Let X be a positive recurrent, irreducible, time-homogeneous Markov process with equilibrium 
probability distribution n. To be definite, suppose X is a continuous-time process, with pure-jump 
sample paths and generator matrix Q. The results of this section apply with minor modifications 
to the discrete-time setting as well. Above it is noted that lim^oo 7Tj(i) = 7r,; = l/(— quMi), where 
Mi is the mean "cycle time" of state i. A related consideration is convergence of the empirical 
distribution of the Markov process, where the empirical distribution is the distribution observed 
over a (usually large) time interval. 

For a fixed state i, the fraction of time the process spends in state i during [0, t] is 

1 r' 

j J I{x(s)=i}ds 

Let To denote the time that the process is first in state i, and let T^ for k > 1 denote the time 
that the process jumps to state i for the fcth time after To. The cycle times T^+i — T&, k > are 
independent and identically distributed, with mean Mj. Therefore, by the law of large numbers, 
with probability one, 



Km H/^) = £» gp^-H) 



1=0 

1 



"If in addition A, + /x» = 1 for all i, the discrete-time process has period 2. 
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Furthermore, during the kth cycle interval [Tk,Tk+i), the amount of time spent by the process 
in state i is exponentially distributed with mean — 1/qa, and the time spent in the state during 
disjoint cycles is independent. Thus, with probability one, 



1 



lim 

k^oo kMi 



i JO 



l {X(s)=i} 



ds 



t-\ /- T '+i 



lim > / 

fc^oo kMi ^ J 7 



i=o " T i 



I{x(s)=i}ds 



-E 



Ti 



1 

Mi un 

l/(-quMi 



I{x(s)=i}ds 



Combining these two observations yields that 

1 rt 



lim - / Isx(s)=i}ds = l/(-qaMi) = m 



(6.11) 



with probability one. In short, the limit (6.11) is expected, because the process spends on average 
— 1/qa time units in state i per cycle from state i, and the cycle rate is 1/Mj. Of course, since state 
i is arbitrary, if j is any other state 



lim 

t— >oo 



i r* 



- j I {X (s)=j}ds = l/i-qjjMj) = tvj 



(6.12) 



By considering how the time in state j is distributed among the cycles from state i, it follows that 
the mean time spent in state j per cycle from state i is Mj7Tj. 
So for any nonnegative function <fi on S, 



lim - 

t— >oo t 



4>(X(s))ds 



liui ---r I <f>(X(s))ds 



fc^oo kM, 



1 

M, 

1 



-E 



i JO 



(p(X(s))ds 



To 






M % . 



{*(«)=# 



r/.s 



{X(s)=j} 



ds 



(6.13) 



Finally, if (j) is a function of S such that either X)jgs <f>+(j)7Tj < oc or ^ ?giS (j)-(j)nj < oo, then 
since (6.13) holds for both 0+ and </>_, it must hold for itself. 
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Figure 6.7: A single server queueing system. 



6.6 Queueing systems, M/M/l queue and Little's law 

Some basic terminology of queueing theory will now be explained. A simple type of queueing system 
is pictured in Figure 6.7. Notice that the system is comprised of a queue and a server. Ordinarily 
whenever the system is not empty, there is a customer in the server, and any other customers in 
the system are waiting in the queue. When the service of a customer is complete it departs from 
the server and then another customer from the queue, if any, immediately enters the server. The 
choice of which customer to be served next depends on the service discipline. Common service 
disciplines are first-come first-served (FCFS) in which customers are served in the order of their 
arrival, or last-come first-served (LCFS) in which the customer that arrived most recently is served 
next. Some of the more complicated service disciplines involve priority classes, or the notion of 
"processor sharing" in which all customers present in the system receive equal attention from the 
server. 

Often models of queueing systems involve a stochastic description. For example, given positive 
parameters A and fi, we may declare that the arrival process is a Poisson process with rate A, 
and that the service times of the customers are independent and exponentially distributed with 
parameter [i. Many queueing systems are given labels of the form A/B/s, where "A" is chosen to 
denote the type of arrival process, "B" is used to denote the type of departure process, and s is 
the number of servers in the system. In particular, the system just described is called an M/M/l 
queueing system, so-named because the arrival process is memoryless (i.e. a Poisson arrival process), 
the service times are memoryless (i.e. are exponentially distributed), and there is a single server. 
Other labels for queueing systems have a fourth descriptor and thus have the form A/B/s/b, where 
b denotes the maximum number of customers that can be in the system. Thus, an M/M/l system 
is also an M/M/l /oo system, because there is no finite bound on the number of customers in the 
system. 

A second way to specify an M/M/l queueing system with parameters A and [i is to let A{t) and 
D{t) be independent Poisson processes with rates A and \x respectively. Process A marks customer 
arrival times and process D marks potential customer departure times. The number of customers in 
the system, starting from some initial value iV(0), evolves as follows. Each time there is a jump of 
A, a customer arrives to the system. Each time there is a jump of D, there is a potential departure, 
meaning that if there is a customer in the server at the time of the jump then the customer departs. 
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If a potential departure occurs when the system is empty then the potential departure has no effect 
on the system. The number of customers in the system N can thus be expressed as 

N(t) = AT(0) + A{t) + / I {N{s -)> 1} dD{s) 

Jo 

It is easy to verify that the resulting process N is Markov, which leads to the third specification of 
an M/M/l queueing system. 

A third way to specify an M/M/l queuing system is that the number of customers in the system 
N(t) is a birth-death process with A^ = A and p^ = p for all k, for some parameters A and p. Let 
p = X/p. Using the classification criteria derived for birth-death processes, it is easy to see that 
the system is recurrent if and only if p < 1, and that it is positive recurrent if and only if p < 1. 
Moreover, if p < 1 the equilibrium distribution for the number of customers in the system is given 
by 7Tfc = (1 — p)p k for k > 0. This is the geometric distribution with zero as a possible value, and 
with mean 

oo oo _. 

to ti l ~P l ~P 

The probability the server is busy, which is also the mean number of customers in the server, is 
1 — ttq = p. The mean number of customers in the queue is thus given by p/(l — p) — p = p 2 /(l — />)• 
This third specification is the most commonly used way to define an M/M/l queueing process. 

Since the M/M/l process N(t) is positive recurrent, the Markov ergodic convergence theorem 
implies that the statistical averages just computed, such as N, are also equal to the limit of the 
time-averaged number of customers in the system as the averaging interval tends to infinity. 

An important performance measure for a queueing system is the mean time spent in the system 
or the mean time spent in the queue. Littles' law, described next, is a quite general and useful 
relationship that aids in computing mean transit time. 

Little's law can be applied in a great variety of circumstances involving flow through a system 
with delay. In the context of queueing systems we speak of a flow of customers, but the same 
principle applies to a flow of water through a pipe. Little's law is that AT = TV where A is the 
mean flow rate, T is the mean delay in the system, and N is the mean content of the system. 
For example, if water flows through a pipe with volume one cubic meter at the rate of two cubic 
meters per minute, the mean time (averaged over all drops of water) that water spends in the pipe 
is T = N/X = 1/2 minute. This is clear if water flows through the pipe without mixing, because 
the transit time of each drop of water is 1/2 minute. However, mixing within the pipe does not 
effect the average transit time. 

Little's law is actually a set of results, each with somewhat different mathematical assumptions. 
The following version is quite general. Figure 6.8 pictures the cumulative number of arrivals (a(i)) 
and the cumulative number of departures (5(t)) versus time, for a queueing system assumed to be 
initially empty. Note that the number of customers in the system at any time s is given by the 
difference N(s) = a(s) — S(s), which is the vertical distance between the arrival and departure 
graphs in the figure. On the other hand, assuming that customers are served in first-come first- 
served order, the horizontal distance between the graphs gives the times in system for the customers. 
Given a (usually large) t > 0, let 7* denote the area of the region between the two graphs over 
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Figure 6.8: A single server queueing system. 



the interval [0, £]. This is the shaded region indicated in the figure. It is natural to define the 
time-averaged values of arrival rate and system content as 

-If 1 
\ t = a(t)/t and N t = - / N(s)ds = -y t /t 

t Jo 

Finally, the average, over the a(t) customers that arrive during the interval [0, t], of the time spent 
in the system up to time t, is given by 

T t = 7t/a(t). 

Once these definitions are accepted, we have the following obvious proposition. 

Proposition 6.6.1 (Little's law, expressed using averages over time) For any t > 0, 

Nt = Wt (6.14) 

Furthermore, if any two of the three variables in (6.14) converge to a positive finite limit as t — > oo, 
then so does the third variable, and the limits satisfy N^ = XooToo- 

For example, the number of customers in an M/M/l queue is a positive recurrent Markov 
process so that 

lim N t = N = p/(l - p) 

t—>oo 

where calculation of the statistical mean ./V was previously discussed. Also, by the law of large 
numbers applied to interarrival times, we have that the Poisson arrival process for an M/M/l 
queue satisfies lim^oo Xt = X with probability one. Thus, with probability one, 

lim T t = N/X = —^—. 

t— >oo /x — A 
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In this sense, the average waiting time in an M/M/l system is l/(/x — A). The average time in service 
is l//x (this follows from the third description of an M/M/l queue, or also from Little's law applied 
to the server alone) so that the average waiting time in queue is given by W = l/(yu — A) — 1/n = 
p/(n — A). This final result also follows from Little's law applied to the queue alone. 

6.7 Mean arrival rate, distributions seen by arrivals, and PASTA 

The mean arrival rate for the M/M/l system is A, the parameter of the Poisson arrival process. 
However for some queueing systems the arrival rate depends on the number of customers in the 
system. In such cases the mean arrival rate is still typically meaningful, and it can be used in 
Little's law. 

Suppose the number of customers in a queuing system is modeled by a birth death process 
with arrival rates (A&) and departure rates (//&). Suppose in addition that the process is positive 
recurrent. Intuitively, the process spends a fraction of time 7Tfc in state k and while in state k the 
arrival rate is A&. Therefore, the average arrival rate is 



oo 



A = 2J ^fcA* 



fc=0 
Similarly the average departure rate is 

oo 

fc=l 

and of course X — ~p because both are equal to the throughput of the system. 

Often the distribution of a system at particular system-related sampling times are more impor- 
tant than the distribution in equilibrium. For example, the distribution seen by arriving customers 
may be the most relevant distribution, as far as the customers are concerned. If the arrival rate 
depends on the number of customers in the system then the distribution seen by arrivals need not 
be the same as the equilibrium distribution. Intuitively, TTkXk is the long-term frequency of arrivals 
which occur when there are k customers in the system, so that the fraction of customers that see 
k customers in the system upon arrival is given by 

The following is an example of a system with variable arrival rate. 

Example 6.7.1 (Single-server, discouraged arrivals) Suppose A& = a/{k + 1) and \x^ — \i for all 
k, where \i and a are positive constants. Then 



oo 



— ^ = oc and Si = Z,^=«p^-|<*. 
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so that the number of customers in the system is a positive recurrent Markov process, with no 
additional restrictions on a and [x. Moreover, the equilibrium probability distribution is given by 
7Tfc = (a/^) fc exp(— a/n)/k\, which is the Poisson distribution with mean TV = a/fi. The mean 
arrival rate is 

OO OO / / \k-\-l 

— v-^ 7r^a ^-~\ [a/fi) . . .. . . . . . i i w 

X = 1^ VTl = ^ ex P(~ a /^) Z^ (k + lY = ^ ex P(-«M)( ex P(a/M) - 1) = M 1 ~ exp(-a//x)). 
fc=0 fc=o ^ '' 

(6.15) 

This expression derived for A is clearly equal to /x, because the departure rate is [i with probability 

1 — 7To and zero otherwise. The distribution of the number of customers in the system seen by 

arrivals, (r^) is given by 

7r fc a {a/n) k+1 exp(-a/n) 

Th = = = -, TT7 ; r-rr tor fc > (J 

{k+l)\ (* + l)!(l-exp(-a//i)) 

which in words can be described as the result of removing the probability mass at zero in the 
Poisson distribution, shifting the distribution down by one, and then renormalizing. The mean 
number of customers in the queue seen by a typical arrival is therefore (a/ ' \x — 1)/(1 — exp(— a/fi}). 
This mean is somewhat less than TV because, roughly speaking, the customer arrival rate is higher 
when the system is more lightly loaded. 

The equivalence of time-averages and statistical averages for computing the mean arrival rate 
and the distribution seen by arrivals can be shown by application of ergodic properties of the 
processes involved. The associated formal approach is described next, in slightly more generality. 
Let X denote an irreducible, positive-recurrent pure-jump Markov process. If the process makes a 
jump from state i to state j at time t, say that a transition of type (i,j) occurs. The sequence of 
transitions of X forms a new Markov process, Y . The process Y is a discrete-time Markov process 
with state space {(i,j) G S x S : qij > 0}, and it can be described in terms of the jump process for 
X, by Y{k) = {X J (k - l),X J (k)) for k > 0. (Let X J (-1) be defined arbitrarily.) 

The one-step transition probability matrix of the jump process X J is given by irfj = qij/(—qu), 
and X is recurrent because X is recurrent. Its equilibrium distribution n (if it exists) is propor- 
tional to —TTiqu (see Problem 6.3), and X J is positive recurrent if and only if this distribution can 
be normalized to make a probability distribution, i.e. if and only if R = — ^ itiqu < oo. Assume 
for simplicity that X J is positive recurrent. Then ivj = —Tnqu/R is the equilibrium probability 
distribution of X . Furthermore, Y is positive recurrent and its equilibrium distribution is given 
by 

TT Y - = TT^V- 

_ T^iQii Qij 

R —Qii 
TTiQij 

R 
Since limiting time averages equal statistical averages for Y, 

lim (number of first n transitions of X that are type (i,j))/n = niqij/R 
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with probability one. Therefore, if A C S x S, and if (i,j) £ A, 

number of first n transitions of X that are type (i, j) T^iQa 

lim — J1 K - /' = — -^ (6.16) 

n->oo number of first n transitions of X with type in A L^U 1 j')eA ^i'Qi'j' 

To apply this setup to the special case of a queueing system in which the number of customers 
in the system is a Markov birth-death processes, let the set A be the set of transitions of the form 
(i, i + 1). Then deduce that the fraction of the first n arrivals that see i customers in the system 
upon arrival converges to 7TjAj/ ^ • ttjXj with probability one. 

Note that if Aj = A for all i, then A = A and n = r. The condition Aj = A also implies that the 
arrival process is Poisson. This situation is called "Poisson Arrivals See Time Averages" (PASTA). 



6.8 More examples of queueing systems modeled as Markov birth- 
death processes 

For each of the four examples of this section it is assumed that new customers are offered to the 
system according to a Poisson process with rate A, so that the PASTA property holds. Also, when 
there are k customers in the system then the service rate is pt for some given numbers p^. The 
number of customers in the system is a Markov birth-death process with A& = A for all k. Since 
the number of transitions of the process up to any given time t is at most twice the number of 
customers that arrived by time t, the Markov process is not explosive. Therefore the process is 
positive recurrent if and only if Si is finite, where 

00 \fc 



t=0 «»•••« 

Special cases of this example are presented in the next four examples. 

Example 6.8.1 (M/M/m systems) An M/M/m queueing system consists of a single queue and 
m servers. The arrival process is Poisson with some rate A and the customer service times are 
independent and exponentially distributed with mean \x for some \x > 0. The total number of 
customers in the system is a birth-death process with /x& = [imm(k,m). Let p = A/(m/x). Since 
Hk = mfi for all k large enough it is easy to check that the process is positive recurrent if and only 
if p < 1. Assume now that p < 1. Then the equilibrium distribution is given by 

TT k = W^} for < k < m 
Sikl 

TT m+j = -K m fP for j > 1 
where Si is chosen to make the probabilities sum to one (use the fact 1 + p + p 2 . . . = 1 / (1 — p)): 

1 V^o kl J rnl(l- pY 
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An arriving customer must join the queue (rather that go directly to a server) if and only if the 
system has m or more customers in it. By the PASTA property, this is the same as the equilibrium 
probability of having m or more customers in the system: 

oo 

P Q = Yl ' Km +J = wc* - P) 

3=0 

This formula is called the Erlang C formula for probability of queueing. 



Example 6.8.2 (M/M/m/m systems) An M/M/m/m queueing system consists of m servers. The 

arrival process is Poisson with some rate A and the customer service times are independent and 

exponentially distributed with mean ji for some \i > 0. Since there is no queue, if a customer 

arrives when there are already m customers in the system, the arrival is blocked and cleared from 

the system. The total number of customers in the system is a birth death process, but with the 

state space reduced to {0, 1, . . . , m}, and with /z^ = k/x for 1 < k < m. The unique equilibrium 

distribution is given by 

(XI u) k 
n k = y '*"> for < k < m 
b\k\ 

where S\ is chosen to make the probabilities sum to one. 

An arriving customer is blocked and cleared from the system if and only if the system already 

has m customers in it. By the PASTA property, this is the same as the equilibrium probability of 

having m customers in the system: 

PB = TTr 



m (X/fi) 



2^=0 j 

This formula is called the Erlang B formula for probability of blocking. 



Example 6.8.3 (A system with a discouraged server) The number of customers in this system is a 
birth-death process with constant birth rate A and death rates m- = 1/k. It is is easy to check that 
all states are transient for any positive value of A (to verify this it suffices to check that S2 < 00). 
It is not difficult to show that N(t) converges to +00 with probability one as t — > 00. 



Example 6.8.4 (A barely stable system) The number of customers in this system is a birth-death 
process with constant birth rate A and death rates [i k = 1+( - fc _ 1 -, 2 for all k > 1. Since the departure 
rates are barely larger than the arrival rates, this system is near the borderline between recurrence 
and transience. However, we see that 

00 1 

Si = Y^ " T7T < OO 

^ 1 + k 2 

k=0 
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so that N(t) is positive recurrent with equilibrium distribution tt^ = l/(Si(l + k 2 )). Note that the 
mean number of customers in the system is 

oo 

N = J2k/(S l (l + k 2 )) = 00 

k=0 

By Little's law the mean time customers spend in the system is also infinity. It is debatable whether 
this system should be thought of as "stable" even though all states are positive recurrent and all 
waiting times are finite with probability one. 



6.9 Foster-Lyapunov stability criterion and moment bounds 

Communication network models can become quite complex, especially when dynamic scheduling, 
congestion, and physical layer effects such as fading wireless channel models are included. It is thus 
useful to have methods to give approximations or bounds on key performance parameters. The 
criteria for stability and related moment bounds discussed in this chapter are useful for providing 
such bounds. 

Aleksandr Mikhailovich Lyapunov (1857-1918) contributed significantly to the theory of stabil- 
ity of dynamical systems. Although a dynamical system may evolve on a complicated, multiple 
dimensional state space, a recurring theme of dynamical systems theory is that stability questions 
can often be settled by studying the potential of a system for some nonnegative potential function 
V. Potential functions used for stability analysis are widely called Lyapunov functions. Similar 
stability conditions have been developed by many authors for stochastic systems. Below we present 
the well known criteria due to Foster [4] for recurrence and positive recurrence. In addition we 
present associated bounds on the moments, which are expectations of some functions on the state 
space, computed with respect to the equilibrium probability distribution. 3 

Subsection 6.9.1 discusses the discrete-time tools, and presents examples involving load balanc- 
ing routing, and input queued crossbar switches. Subsection 6.9.2 presents the continuous time 
tools, and an example. 

6.9.1 Stability criteria for discrete-time processes 

Consider an irreducible discrete- time Markov process X on a countable state space S, with one-step 
transition probability matrix P. If / is a function on <S, Pf represents the function obtained by 
multiplication of the vector / by the matrix P: Pf(i) = Y2j^sPijf(j)- If / is nonnegative, Pf 
is well defined, with the understanding that Pf(i) = +00 is possible for some, or all, values of i. 
An important property of Pf is that Pf(i) = E[f(X(t + l)|X(i) = i\. Let V be a nonnegative 
function on S, to serve as the Lyapunov function. The drift vector of V(X(t)) is defined by 



A version of these moment bounds was given by Tweedie [15] , and a version of the moment bound method was 
used by Kingman [5] in a queueing context. As noted in [9], the moment bound method is closely related to Dynkin's 
formula. The works [13, 14, 6, 12], and many others, have demonstrated the wide applicability of the stability methods 
in various queueing network contexts, using quadratic Lyapunov functions. 
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d(i) = E[V(X(t + l))\X(t) =i]-V(i). That is, d = PV - V. Note that d(i) is always well-defined, 
if the value +00 is permitted. The drift vector is also given by 

<i)= Y,Pij(V(j)-V(t)). (6.17) 

Proposition 6.9.1 (Foster-Lyapunov stability criterion) Suppose V : S — > R+ and C is a finite 
subset of S. 

(a) If {i : V(i) < if} is finite for all K, and if PV — V < on S — C , then X is recurrent. 

(b) If e > and b is a constant such that PV — V < — e + blc, then X is positive recurrent. 

Proposition 6.9.2 (Moment bound) Suppose V , f , and g are nonnegative functions on S and 
suppose 

PV{i) - V{%) < -/(») + g(i) for all i e S (6.18) 

In addition, suppose X is positive recurrent, so that the means, f — irf and g = irg are well-defined. 
Then f <g. (In particular, if g is bounded, then g is finite, and therefore f is finite.) 

Corollary 6.9.3 (Combined Foster-Lyapunov stability criterion and moment bound) Suppose V, f, 
and g are nonnegative functions on S such that 

PV{i) - V{%) < -f(i) + g(i) for all i e S (6.19) 

In addition, suppose for some e > that the set C defined by C = {i : f(i) < g{i) + t} is finite. Then 
X is positive recurrent and f <g. (In particular, if g is bounded, then g is finite, and therefore f 
is finite.) 

Proof. Let b = max{a(i) + e — f(i) : i £ C}. Then V,C,b, and e satisfy the hypotheses of 
Proposition 6.9.1(b), so that X is positive recurrent. Therefore the hypotheses of Proposition 6.9.2 
are satisfied, so that / <g. I 

The assumptions in Propositions 6.9.1 and 6.9.2 and Corollary 6.9.3 do not imply that V is 
finite. Even so, since V is nonnegative, for a given initial state ^(0), the long term average drift of 
V(X(t)) is nonnegative. This gives an intuitive reason why the mean downward part of the drift, 
/, must be less than or equal to the mean upward part of the drift, g. 



Example 6.9.4 (Probabilistic routing to two queues) Consider the routing scenario with two 
queues, queue 1 and queue 2, fed by a single stream of packets, as pictured in Figure 6.9. Here, 
< a,u,d±,d2 < 1, and u = 1 — u. The state space for the process is S = Z^_, where the state 
x = (xi,X2) denotes x\ packets in queue 1 and X2 packets in queue 2. In each time slot, a new 
arrival is generated with probability a, and then is routed to queue 1 with probability u and to 
queue 2 with probability u. Then each queue i, if not empty, has a departure with probability dj. 
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Figure 6.9: Two queues fed by a single arrival stream. 

Note that we allow a packet to arrive and depart in the same slot. Thus, if Xi(t) is the number of 
packets in queue i at the beginning of slot t, then the system dynamics can be described as follows: 

Xi{t + 1) = Xi{t) + Ai{t) - Di(t) + Li(t) for ie {0,1} (6.20) 



where 



A(t) = (Ai(t) , A2(t)) is equal to (1,0) with probability au, (0,1) with probability au, and 
A(t) = (0, 0) otherwise. 



• Di{t) : t > 0, are Bernoulli(di) random variables, for i £ {0, 1} 

• All the A(t)'s, Z?i(t)'s, and D2(t)'s are mutually independent 

• Lj(t) = (—(Xi(t) + Aj(i) — Di(t)))+ (see explanation next) 

If Xiit) + Aj(t) = 0, there can be no actual departure from queue i. However, we still allow Di(t) 
to equal one. To keep the queue length process from going negative, we add the random variable 
Li(t) in (6.20). Thus, Di(t) is the potential number of departures from queue i during the slot, and 
Di(t) — Li(t) is the actual number of departures. This completes the specification of the one-step 
transition probabilities of the Markov process. 

A necessary condition for positive recurrence is, for any routing policy, a < di + d,2, because the 
total arrival rate must be less than the total depature rate. We seek to show that this necessary 
condition is also sufficient, under the random routing policy. 

Let us calculate the drift of V(X(t)) for the choice V(x) = {x\ + x$)/2. Note that (A^i+l)) 2 = 
(Xi(t) + Ai(t) - Di(t) + Li(t)) 2 < (Xi(t) + Ai(t) - Di(t)) 2 , because addition of the variable L*(i) 
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can only push Xi{t) + Ai{t) — Di{t) closer to zero. Thus, 

PV(x)-V(x) = E[V(X(t + l))\X(t)=x]-V(x) 
i 2 



2 i 

5>^L4;(i) - A(t)|X(t) = x] + -^[(A(«) - Di(t)) 2 \X(t) = x] (6.21) 



i=i 

/ 2 



< 



5>;£L4;(i)-A(*)TO) 



vi=l 



= — (xi(di — au) + x 2 (d 2 — an)) + 1 (6.22) 

Under the necessary condition a < d\ + d 2 , there are choices of u so that au < d\ and au < di, and 
for such u the conditions of Corollary 6.9.3 are satisfied, with f(x) = Xi(di — au) + ^2(^2 — au), 
g(x) = 1, and any e > 0, implying that the Markov process is positive recurrent. In addition, the 
first moments under the equlibrium distribution satisfy: 

(di - au)X~i + (d 2 - au)X 2 < 1. (6.23) 

In order to deduce an upper bound on X\ + X2, we select n* to maximize the minimum of the 
two coefficients in (6.23). Intuitively, this entails selecting u to minimize the absolute value of the 
difference between the two coefficients. We find: 

e = max min{di — au, do — au} 

0<u<l 

. r , j di + d 2 -a 
= mm{di,d 2 , } 

and the corresponding value u* of u is given by 

if d\ — d 2 < —a 

u * = { \ + T if Ml " M < « 



2 T 2a 

1 if d\ — d 2 > a 



For the system with u = u* , (6.23) yields 



X x + X 2 <-. (6.24) 

e 



We remark that, in fact, 



^i + ^2<^— (6.25) 

di + d 2 - a 

li \d\ — d 2 \ < a then the bounds (6.24) and (6.25) coincide, and otherwise, the bound (6.25) is 
strictly tighter. If d\ — d 2 < —a then u* = 0, so that X\ = 0, and (6.23) becomes {d 2 — a)X 2 < 1 
, which implies (6.25). Similarly, if d± — d 2 > a, then u* = 1, so that X 2 = 0, and (6.23) becomes 
(di — a)X\ < 1, which implies (6.25). Thus, (6.25) is proved. 
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Example 6.9.5 (Route-to-shorter policy) Consider a variation of the previous example such that 
when a packet arrives, it is routed to the shorter queue. To be definite, in case of a tie, the packet 
is routed to queue 1. Then the evolution equation (6.20) still holds, but with with the description 
of the arrival variables changed to the following: 

• Given X(t) = (xi,x 2 ), A(t) = (I{ Xl <x 2 }, I {x 1 >x 2 }) wi th probability a, and A(t) = (0,0) 
otherwise. 

Let P RS denote the one-step transition probability matrix when the route-to-shorter policy is used. 
Proceeding as in (6.21) yields: 

2 

P RS V(x) - V(x) < J2xiE[Ai(t)-Di(t))\X(t) = x] + l 

z=i 
= a {xiI{ Xl < X2 y + x 2 I{ Xl>X2 }) - d\Xi - d 2 x 2 + 1 



Note that x\Ii Xl < X2 \ + X2l{ Xl>X2 } < ux\ + ux 2 for any u G [0, 1], with equality for u = I{ Xl < X2 \- 
Therefore, the drift bound for V under the route-to-shorter policy is less than or equal to the drift 
bound (6.22), for V for any choice of probabilistic splitting. In fact, route-to-shorter routing can 
be viewed as a controlled version of the independent splitting model, for which the control policy is 
selected to minimize the bound on the drift of V in each state. It follows that the route-to-shorter 
process is positive recurrent as long as a < d\ + d 2 , and (6.23) holds for any value of u such that 
au < d\ and au < d 2 . In particular, (6.24) holds for the route-to-shorter process. 

We remark that the stronger bound (6.25) is not always true for the route-to-shorter policy. 
The problem is that even if d\ — d 2 < —a, the route-to-shorter policy can still route to queue 1, 
and so X\ / 0. In fact, if a and d 2 are fixed with < a < d 2 < 1, then X\ — > oo as d\ — > for 
the route-to-shorter policy. Intuitively, that is because occasionally there will be a large number of 
customers in the system due to statistical fluctuations, and then there will be many customers in 
queue 1. But if d 2 « 1, those customers will remain in queue 2 for a very long time. 



Example 6.9.6 (An input queued switch with probabilistic switching) 4 Consider a packet switch 
with iV inputs and TV outputs, as pictured in Figure 6.10. Suppose there are iV 2 queues — N at 
each input - with queue i,j containing packets that arrived at input i and are destined for output 
j, for i,j g E, where E = {1, • • • ,N}. Suppose the packets are all the same length, and adopt 
a discrete-time model, so that during one time slot, a transfer of packets can occur, such that at 
most one packet can be transferred from each input, and at most one packet can be transferred to 
each output. A permutation a of E has the form a = (o~i, . . . , ct/v), where o~i, . . . , o~n are distinct 



4 Tassiulas [12] originally developed the results of Examples 6.9.6 and 6.9.7, in the context of wireless networks. 
The paper [8] presents similiar results in the context of a packet switch. 
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Figure 6.10: A 4 x 4 input queued switch. 



elements of E. Let II denote the set of all iV! such permutations. Given a £ II, let R(cr) be the 
N x N switching matrix defined by Rij = h CFi= j\. Thus, Rij{cr) = 1 means that under permutation 
a, input i is connected to output j, or, equivalently, a packet in queue i, j is to depart, if there is 



(Xi 



: i. 



any such packet. A state x of the system has the form x 
number of packets in queue i,j. 

The evolution of the system over a time slot [t, t + 1) is described as follows: 



j £ E), where xij denotes the 



Xij(t + 1) = X^t) + Aij(t) - Rij{a{t)) + L i:i (t) 



where 



Aij(t) is the number of packets arriving at input i, destined for output j, in the slot. Assume 
that the variables (Aij(t) : i,j G E,t > 0) are mutually independent, and for each i,j, the 
random variables (Aij(t) : t > 0) are independent, identically distributed, with mean Ajj and 
^[-Af,-] ^ Kij> f° r some constants Ajj and K, L j. Let A 



u J 



(Xij :i,j€ E). 



a(t) is the switch state used during the slot 



• Lij = { — {Xij{t) + Aij(t) — Rij(a(t))) + , which takes value one if there was an unused potential 
departure at queue ij during the slot, and is zero otherwise. 

The number of packets at input i at the beginning of the slot is given by the row sum 
S ? 'eB^ij(*)j its mean is given by the row sum ^j^gXij, and at most one packet at input i 
can be served in a time slot. Similarly, the set of packets waiting for output j, called the virtual 
queue for output j, has size given by the column sum ^ig_E^«i(^)- The mean number of arrivals 
to the virtual queue for output j is ^ieB^*i(^)' an< ^ a ^ mos t one packet in the virtual queue can 
be served in a time slot. These considerations lead us to impose the following restrictions on A: 



y Xij < 1 for all i 

jeE 



and 



E 

i&E 



Xij < 1 for all j 



(6.26) 
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Except for trivial cases involving deterministic arrival sequences, the conditions (6.26) are necessary 
for stable operation, for any choice of the switch schedule (o~(t) : t > 0). 

Let's first explore random, independent and identically distributed (i.i.d.) switching. That is, 
given a probability distribution u on II, let {o~{t) : t > 0) be independent with common probability 
distribution u. Once the distributions of the Aij's and u are fixed, we have a discrete- time Markov 
process model. Given A satisfying (6.26), we wish to determine a choice of u so that the process 
with i.i.d. switch selection is positive recurrent. 

Some standard background from switching theory is given in this paragraph. A line sum of a 
matrix M is either a row sum, ^2-Mij, or a column sum, ^^Mij. A square matrix M is called 
doubly stochastic if it has nonnegative entries and if all of its line sums are one. Birkhoff 's theorem, 
celebrated in the theory of switching, states that any doubly stochastic matrix M is a convex 
combination of switching matrices. That is, such an M can be represented as M = ^ ff£l] R{a)u{a), 
where u = (u(o~) : a G LT) is a probability distribution on LT. If M is a nonnegative matrix with all 
line sums less than or equal to one, then if some of the entries of M are increased appropriately, 
a doubly stochastic matrix can be obtained. That is, there exists a doubly stochastic matrix M 
so that Mij < Mij for all i,j. Applying Birkhoff 's theorem to M yields that there is a probability 
distribution u so that M^ < S CT en R{^)u{a) for all i,j. 

Suppose A satisfies the necessary conditions (6.26). That is, suppose that all the line sums of 
A are less than one. Then with e defined by 

1 — (maximum line sum of A) 

£= N ' 

each line sum of (A^ + e : i,j G E) is less than or equal to one. Thus, by the observation at the 
end of the previous paragraph, there is a probability distribution u* on II so that Ay + e < mj(u*), 
where 

fJ>ij(u) = ^2 Rijio-)u(o-). 
o-en 

We consider the system using probability distribution u* for the switch states. That is, let (cr(t) : 
t > 0) be independent, each with distribution u* . Then for each ij, the random variables Rij{o~{t)) 
are independent, Bernoulli(/ijj(n*)) random variables. 

Consider the quadratic Lyapunov function V given by V(x) = ^ X^i j x 1j- ^ s i n (6-21), 

PV(x) - V(x) < Y,XijE[Aij(t) - R l3 (a(t))\X i3 (t) = x} + ^E[(^-(t) - R tJ (a(t))) 2 \X(t) = x). 

Now 

E[Aij(t) - RijMtylXijlt) = x} = E[Aij(t) - RiMt))} = \ij - nij(u*) < -e 

and 

\ J2 E[(A,(t) - R % Mt))?\x{t) = X ^\H E[(Mt)) 2 + (RiMt))) 2 } < k 
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where K = \{N + £. , K i3 ). Thus, 



PV{x) - V(x) <-f ^iJ + K (6.27) 



Therefore, by Corollary 6.9.3, the process is positive recurrent, and 

J2Xij<^ (6.28) 

ij 

That is, the necessary condition (6.26) is also sufficient for positive recurrence and finite mean queue 
length in equilibrium, under i.i.d. random switching, for an appropriate probability distribution u* 
on the set of permutations. 



Example 6.9.7 (An input queued switch with maximum weight switching) The random switching 
policy used in Example 2a depends on the arrival rate matrix A, which may be unknown a priori. 
Also, the policy allocates potential departures to a given queue ij, whether or not the queue is 
empty, even if other queues could be served instead. This suggests using a dynamic switching 
policy, such as the maximum weight switching policy, defined by a{t) = a (X(t)), where for a 
state x, 

a MW (x) = argmaxJ2 x ij R ij( a )- ( 6 - 29 ) 

ij 

The use of "argmax" here means that a (x) is selected to be a value of a that maximizes the 
sum on the right hand side of (6.29), which is the weight of permutation a with edge weights Xij. In 
order to obtain a particular Markov model, we assume that the set of permutations II is numbered 
from 1 to AH in some fashion, and in case there is a tie between two or more permutations for having 
the maximum weight, the lowest numbered permutation is used. Let P MW denote the one-step 
transition probability matrix when the route-to-shorter policy is used. 

Letting V and K be as in Example 2a, we find under the maximum weight policy that 

P MW V{x) - V(x) < $>«(A« - i? 4J (a MW/ (x))) + K 
ij 

The maximum of a function is greater than or equal to the average of the function, so that for any 
probability distribution u on LT 

$> y i2i> W (*)) > 5>(a)J>^(a) (6.30) 

ij u ij 

= / J XjjUijjl 



■'J 
... ->) 
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with equality in (6.30) if and only if u is concentrated on the set of maximum weight permutations. 
In particular, the choice u = u* shows that 



ij 



XijRij(a MW (t)) > y]xij[ijj(u*) > ^2xij(Xij + e) 



Therefore, if P is replaced by P MW ; (6.27) still holds. Therefore, by Corollary 6.9.3, the process 
is positive recurrent, and the same moment bound, (6.28), holds, as for the randomized switching 
strategy of Example 2a. On one hand, implementing the maximum weight algorithm does not 
require knowledge of the arrival rates, but on the other hand, it requires that queue length infor- 
mation be shared, and that a maximization problem be solved for each time slot. Much recent 
work has gone towards reduced complexity dynamic switching algorithms. 



6.9.2 Stability criteria for continuous time processes 

Here is a continuous time version of the Foster-Lyapunov stability criteria and the moment bounds. 
Suppose X is a time-homegeneous, irreducible, continuous-time Markov process with generator 
matrix Q. The drift vector of V(X(t)) is the vector QV. This definition is motivated by the fact 
that the mean drift of X for an interval of duration h is given by 



dh{i) 



E[V{X(t + h))\X{t) = i]-V(i) 
h 

= E(^ + f)^' (6 - 31) 

so that if the limit as h — > can be taken inside the summation in (6.31), then dh(i) — > QV{i) as 
h — ► 0. The following useful expression for QV follows from the fact that the row sums of Q are 
zero: 

QV(i) = Y, QijlYU) - V(i)). (6.32) 

Formula (6.32) is quite similar to the formula (6.17) for the drift vector for a discrete-time process. 

Proposition 6.9.8 (Foster-Lyapunov stability criterion-continuous time) Suppose V : S — > IR+ 
and C is a finite subset of S. 

(a) If QV < on S — C ' , and {i : V{i) < K} is finite for all K then X is recurrent. 

(b) Suppose for some b > and e > that 

QV{i) < -e + blc(i) for all i € S. (6.33) 

Suppose further that {i : V(i) < K} is finite for all K, or that X is nonexplosive. Then X is 
positive recurrent. 
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Figure 6.11: A system of three queues with two servers. 



Example 6.9.9 Suppose X has state space S = Z+, with qiQ = fj, for all i > 1, qu+i = Aj for 
all i > 0, and all other off-diagonal entries of the rate matrix Q equal to zero, where \x > and 
Xi > such that £i>o X" < +°°- Let C = i°}> y (°) = °. and V ( i ) = 1 for i > 0. Then 
QV = —fi + (Ao + ju)-fcj so that (6.33) is satisfied with e = u, and b = Ao + \x. However, X is not 
positive recurrent. In fact, X is explosive. To see this, note that pf i+ i = .. A > expf— x L ). Let 



M+Ai 



5 be the probability that, starting from state 0, the jump process does not return to zero. Then 
^ = T¥hLoPii+i — ex P(~/'Ei=o tt) > 0- Thus, X J is transient. After the last visit to state zero, all 



the jumps of X J are up one. The corresponding mean holding times of X are 



l 



which have a 



finite sum, so that the process X is explosive. This example illustrates the need for the assumption 
just after (6.33) in Proposition 6.9.8. 



As for the case of discrete time, the drift conditions imply moment bounds. 

Proposition 6.9.10 (Moment bound-continuous time) Suppose V , f, and g are nonnegative func- 
tions on S, and suppose QV(i) < —f(i) + g(i) for all i £ S. In addition, suppose X is positive 
recurrent, so that the means, f = irf and g = ng are well-defined. Then f <g. 

Corollary 6.9.11 (Combined Foster-Lyapunov stability criterion and moment bound-continuous 
time) Suppose V, f, and g are nonnegative functions on S such that QV(i) < —f(i) + g{i) for all 
i £ S, and, for some e > 0, the set C defined by C = {i : f(i) < g(i) + e} is finite. Suppose also 
that {i : V(i) < K} is finite for all K. Then X is positive recurrent and f <g. 



Example 6.9.12 (Random server allocation with two servers) Consider the system shown in Fig- 
ure 6.11. Suppose that each queue i is fed by a Poisson arrival process with rate Aj, and suppose 
there are two potential departure processes, D\ and D2, which are Poisson processes with rates 
mi an d ni2, respectively. The five Poisson processes are assumed to be independent. No matter 
how the potential departures are allocated to the permitted queues, the following conditions are 
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necessary for stability: 

Ai < mi, A3 < m2, and Ai + A2 + A3 < mi + 1112 (6.34) 

That is because server 1 is the only one that can serve queue 1, server 2 is the only one that can 
serve queue 3, and the sum of the potential service rates must exceed the sum of the potential 
arrival rates for stability. A vector x = {x\,X2,X2) G l\ corresponds to x% packets in queue i for 
each i. Let us consider random selection, so that when Di has a jump, the queue served is chosen at 
random, with the probabilities determined by u = (m, 112). As indicated in Figure 6.11, a potential 
service by server 1 is given to queue 1 with probability u\, and to queue 2 with probability u\. 
Similarly, a potential service by server 2 is given to queue 2 with probability 112, and to queue 3 
with probability U2. The rates of potential service at the three stations are given by 

\i\(u) = u\mi 

H2{v) = UTmi + 1121712 

(i 3 (u) = U2~m 2 . 
Let V{x) = 2(^1 + x 2 + x t)- Using (6.32), we find that the drift vector QV is given by 

qv(x) = \ rt((xi + 1) 2 - x*)\A + \ (j2(( Xi - 1) 2 , - x,v(«) j 

Now (x.i — \)\ < (xi — l) 2 , so that 

QV(x) < ( J2 Xi(\i ~ m(u)) J + I (6.35) 

where 7 is the total rate of events, given by 7 = Ai + ^2 + ^3 + fJ-i(u) + fJ-2(u)+fJ-3(u), or equivalently, 
7 = Ai + A2 + A3 + mi + rri2- Suppose that the necessary condition (6.34) holds. Then there exists 
some e > and choice of u so that 

Aj + e < Hi(u) for 1 < i < 3 

and the largest such choice of e is e = min{mi — Ai,m2 — A3, mi+m2 ~ 3 1 ~ 2 ~ 3 }. (See excercise.) 
So QV(x) < — e{x\ + X2 + X3) + 7 for all x, so Corollary 6.9.11 implies that X is positive recurrent 
andXi+Xs + Xs^ £. 



Example 6.9.13 (Longer first server allocation with two servers) This is a continuation of Example 
6.9.12, concerned with the system shown in Figure 6.11. Examine the right hand side of (6.35). 
Rather than taking a fixed value of u, suppose that the choice of u could be specified as a function 
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of the state x. The maximum of a function is greater than or equal to the average of the function, 
so that for any probability distribution u, 

3 
y Xifti(u) < max) XiHi{v!) (6.36) 

i=l i 

= maxmi(xiu' 1 + X2U > 1 ) + vn^ix^^i + £3^2) 

u' 

= mi(xi V X2) + m 2 (x 2 V 23) 

with equality in (6.36) for a given state x if and only if a longer first policy is used: each service 
opportunity is allocated to the longer queue connected to the server. Let Q denote the one-step 
transition probability matrix when the longest first policy is used. Then (6.35) continues to hold 
for any fixed u, when Q is replaced by Q . Therefore if the necessary condition (6.34) holds, e 
can be taken as in Example 6.9.12, and Q LF V(x) < —e{x\ + £2 + x s) + 7 f° r & U x. So Corollary 
6.9.11 implies that X is positive recurrent under the longer first policy, and X\ + X2 + X3 < ^-. 
(Note: We see that 

Q LF V(x) < I ^2 x ^i I ~ rni(xi V x 2 ) - m 2 (x2 V x 3 ) + -, 

but for obtaining a bound on Xi + X 2 + X 3 it was simpler to compare to the case of random service 
allocation.) 



6.10 Problems 

6.1 Mean hitting time for a simple Markov process 

Let (X(n) : n > 0) denote a discrete-time, time-homogeneous Markov chain with state space 
{0, 1, 2, 3} and one-step transition probability matrix 

( 10 0^ 

1-a a 
0.5 0.5 

\ 010/ 

for some constant a with < a < 1. (a) Sketch the transition probability diagram for X and 
give the equilibrium probability vector. If the equilibrium vector is not unique, describe all the 
equilibrium probability vectors, 
(b) Compute £[min{n > 1 : X(n) = 3}\X(0) = 0]. 

6.2 A two station pipeline in continuous time 

This is a continuous-time version of Example 4.9.1. Consider a pipeline consisting of two single- 
buffer stages in series. Model the system as a continuous-time Markov process. Suppose new packets 
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are offered to the first stage according to a rate A Poisson process. A new packet is accepted at 
stage one if the buffer in stage one is empty at the time of arrival. Otherwise the new packet is 
lost. If at a fixed time t there is a packet in stage one and no packet in stage two, then the packet is 
transfered during [t,t+h) to stage two with probability h/u,i+o(h). Similarly, if at time t the second 
stage has a packet, then the packet leaves the system during [t, t + h) with probability hfi2 + o(h), 
independently of the state of stage one. Finally, the probability of two or more arrival, transfer, or 
departure events during [t,t + h) is o{h). (a) What is an appropriate state-space for this model? 
(b) Sketch a transition rate diagram, (c) Write down the Q matrix, (d) Derive the throughput, 
assuming that A = \x\ — ^2 — 1- (e) Still assuming A = /xi = ^2 — 1- Suppose the system starts 
with one packet in each stage. What is the expected time until both buffers are empty? 

6.3 Equilibrium distribution of the jump chain 

Suppose that tt is the equilibrium distribution for a time-homogeneous Markov process with tran- 
sition rate matrix Q. Suppose that -B -1 = ^ li —qwKi, where the sum is over all i in the state space, 
is finite. Show that the equilibrium distribution for the jump chain (X J (k) : k > 0) (defined in 
Section 4.10) is given by nf = —Bq„TTi. (So tt and n are identical if and only if qa is the same for 
all i.) 

6.4 A simple Poisson process calculation 

Let (N(t) : t > 0) be a Poisson random process with rate A > 0. Compute P(N(s) = i\N(t) = k) 
where < s < t and i and k are nonnegative integers. (Caution: note order of s and t carefully). 

6.5 A simple question of periods 

Consider a discrete-time Markov process with the nonzero one-step transition probabilities indi- 
cated by the following graph. 




(a) What is the period of state 4? 

(b) What is the period of state 6? 

6.6 A mean hitting time problem 

Let (X(t) : t > 0) be a time- homogeneous, pure-jump Markov process with state space {0,1,2} 
and Q matrix 

-4 2 2 

Q=\ 1 "2 1 
2 -2. 
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(a) Write down the state transition diagram and compute the equilibrium distribution. 

(b) Compute a- L = E[min{t > : X(t) = 1}|A(0) = i] for i = 0, 1,2. If possible, use an approach 
that can be applied to larger state spaces. 

(c) Derive a variation of the Kolmogorov forward differential equations for the quantities: onit) = 
P{X(s) / 2 for < s < t and X(t) = i\X(G) = 0) for < i < 2. (You need not solve the 
equations.) 

(d) The forward Kolmogorov equations describe the evolution of an initial probability distribution 
going forward in time, given an initial. In other problems, a boundary condition is given at a 
final time, and a differential equation working backwards in time from a final condition is called 
for (called Kolmogorov backward equations). Derive a backward differential equation for: l3j{t) = 
P(X(s) / 2 for t < s < t f \X(t) = j), for < j < 2 and t < t f for some fixed time tf. (Hint: 
Express f3i(t — h) in terms of the /3j(i)'s for t < tf, and let h — > 0. You need not solve the equations.) 

6.7 A birth-death process with periodic rates 

Consider a single server queueing system in which the number in the system is modeled as a 
continuous time birth-death process with the transition rate diagram shown, where \ a ,\b,[i a , and 
M, are strictly positive constants. 

X, a Xf, x a A. fe 



/■ \ ^v'~T ^v/^^Y ^~X ~< / 

4 



^XJUJUJUJ 




% C b ^a % 

(a) Under what additional assumptions on these four parameters is the process positive recurrent? 

(b) Assuming the system is positive recurrent, under what conditions on A a , A&, fj, a , and fib is it true 
that the distribution of the number in the system at the time of a typical arrival is the same as the 
equilibrium distribution of the number in the system? 

6.8 Markov model for a link with resets 

Suppose that a regulated communication link resets at a sequence of times forming a Poisson 
process with rate fi. Packets are offered to the link according to a Poisson process with rate A. 
Suppose the link shuts down after three packets pass in the absence of resets. Once the link is 
shut down, additional offered packets are dropped, until the link is reset again, at which time the 
process begins anew. 



(a) Sketch a transition rate diagram for a finite state Markov process describing the system state. 

(b) Express the dropping probability (same as the long term fraction of packets dropped) in terms 
of A and fi. 
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6.9 An unusual birth-death process 

Consider the birth-death process X with arrival rates A& = (p/(l — p)) k /&k an d death rates [ij~ = 
(p/(l — p)) k ~ 1 /ak, where .5 < p < 1, and a = (ao,a\,...) is a probability distribution on the 
nonnegative integers with a^ > for all k. (a) Classify the states for the process X as transient, 
null recurrent or positive recurrent, (b) Check that aQ = 0. Is a an equilibrium distribution for 
X? Explain, (c) Find the one-step transition probabilities for the jump-chain, X (d) Classify the 
states for the process X J as transient, null recurrent or positive recurrent. 

6.10 A queue with decreasing service rate 

Consider a queueing system in which the arrival process is a Poisson process with rate A. Suppose 
the instantaneous completion rate is [i when there are K or fewer customers in the system, and jjl/2 
when there are K + 1 or more customers in the system. The number in the system is modeled as a 
birth-death Markov process, (a) Sketch the transition rate diagram, (b) Under what condition on 
A and /j, are all states positive recurrent? Under this condition, give the equilibrium distribution, 
(c) Suppose that A = (2/3)/x. Describe in words the typical behavior of the system, given that it is 
initially empty. 

6.11 Limit of a distrete time queueing system 

Model a queue by a discrete-time Markov chain by recording the queue state after intervals of q 
seconds each. Assume the queue evolves during one of the atomic intervals as follows: There is an 
arrival during the interval with probability aq, and no arrival otherwise. If there is a customer in 
the queue at the beginning of the interval then a single departure will occur during the interval 
with probability f3q. Otherwise no departure occurs. Suppose that it is impossible to have an 
arrival and a departure in a single atomic interval, (a) Find afc=P(an interarrival time is kq) and 
6fc=P(a service time is kq). (b) Find the equilibrium distribution, p = {p^ : k > 0), of the number 
of customers in the system at the end of an atomic interval. What happens as q — > 0? 

6.12 An M/M/l queue with impatient customers 

Consider an M/M/l queue with parameters A and [i with the following modification. Each customer 
in the queue will defect (i.e. depart without service) with probability ah + o{h) in an interval of 
length h, independently of the other customers in the queue. Once a customer makes it to the 
server it no longer has a chance to defect and simply waits until its service is completed and then 
departs from the system. Let N(t) denote the number of customers in the system (queue plus 
server) at time t. (a) Give the transition rate diagram and generator matrix Q for the Markov 
chain N = (N(t) : t > 0). (b) Under what conditions are all states positive recurrent? Under this 
condition, find the equilibrium distribution for N. (You need not explicitly sum the series.) (c) 
Suppose that a = /J,. Find an explicit expression for pp, the probability that a typical arriving 
customer defects instead of being served. Does your answer make sense as \j \x converges to zero 
or to infinity? 

6.13 Statistical multiplexing 

Consider the following scenario regarding a one-way link in a store-and-forward packet commu- 
nication network. Suppose that the link supports eight connections, each generating traffic at 5 
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kilobits per second (kbps). The data for each connection is assumed to be in packets exponentially 
distributed in length with mean packet size 1 kilobit. The packet lengths are assumed mutually 
independent and the packets for each stream arrive according to a Poisson process. Packets are 
queued at the beginning of the link if necessary, and queue space is unlimited. Compute the mean 
delay (queueing plus transmission time-neglect propagation delay) for each of the following three 
scenarios. Compare your answers, (a) (Full multiplexing) The link transmit speed is 50 kbps. (b) 
The link is replaced by two 25 kbps links, and each of the two links carries four sessions. (Of course 
the delay would be larger if the sessions were not evenly divided.) (c) (Multiplexing over two links) 
The link is replaced by two 25 kbps links. Each packet is transmitted on one link or the other, and 
neither link is idle whenever a packet from any session is waiting. 

6.14 A queue with blocking 

(M/M/l/5 system) Consider an M/M/l queue with service rate fi, arrival rate A, and the modifi- 
cation that at any time, at most five customers can be in the system (including the one in service, 
if any). If a customer arrives and the system is full (i.e. already has five customers in it) then the 
customer is dropped, and is said to be blocked. Let N(t) denote the number of customers in the 
system at time t. Then (N(t) : t > 0) is a Markov chain, (a) Indicate the transition rate diagram of 
the chain and find the equilibrium probability distribution, (b) What is the probability, ps, that a 
typical customer is blocked? (c) What is the mean waiting time in queue, W, of a typical customer 
that is not blocked? (d) Give a simple method to numerically calculate, or give a simple expression 
for, the mean length of a busy period of the system. (A busy period begins with the arrival of a 
customer to an empty system and ends when the system is again empty.) 

6.15 Three queues and an autonomously traveling server 

Consider three stations that are served by a single rotating server, as pictured. 

k i I — : — 

■- station 1 



k 

"- station 2 



3 
"- station 3 



Customers arrive to station i according to a Poisson process of rate Aj for 1 < i < 3, and the total 
service requirement of each customer is exponentially distributed, with mean one. The rotation 
of the server is modelled by a three state Markov process with the transition rates a, (3, and 7 as 
indicated by the dashed lines. When at a station, the server works at unit rate, or is idle if the 
station is empty. If the service to a customer is interrupted because the server moves to the next 
station, the service is resumed when the server returns. 



202 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS 

(a) Under what condition is the system stable? Briefly justify your answer. 

(b) Identify a method for computing the mean customer waiting time at station one. 

6.16 On two distibutions seen by customers 

Consider a queueing system in which the number in the system only changes in steps of plus one 
or minus one. Let D(k, t) denote the number of customers that depart in the interval [0,t] that 
leave behind exactly k customers, and let R(k,t) denote the number of customers that arrive in the 
interval [0,t] to find exactly k customers already in the system, (a) Show that \D(k, t) — R{k, t) | < 1 
for all k and t. (b) Let at (respectively St ) denote the number of arrivals (departures) up to time 
t. Suppose that at — > oo and at/St — > 1 as t — > oo. Show that if the following two limits exist for 
a given value k, then they are equal: r^ = lim^oo R(k, t)/a t and d^ = limt_>oo D(k, t)/St- 

6.17 Recurrence of mean zero random walks 

(a) Suppose B±, B2, . . ■ is a sequence of independent, mean zero, integer valued random variables, 
which are bounded, i.e. P{|i?j| < M} = 1 for some M. 

(a) Let Xo = and X n — B\ + • • • + B n for n > 0. Show that X is recurrent. 

(b) Suppose Yq = and Y n+ \ = Y n + B n + L n , where L n = (—(Y n + B n )) + . The process Y is a 
reflected version of X. Show that Y is recurrent. 

6.18 Positive recurrence of reflected random walk with negative drift 

Suppose Bi, B2, ■ ■ ■ is a sequence of independent, integer valued random variables, each with mean 
B < and second moment B 2 < +00. Suppose Xq = and X n+ \ = X n + B n + L n , where 
L n = {—{X n + B n )) + . Show that X is positive recurrent, and give an upper bound on the mean 
under the equilibrium distribution, X. (Note, it is not assumed that the I?'s are bounded.) 

6.19 Routing with two arrival streams 

(a) Generalize Example 6.9.4 to the scenario shown. 
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where ai,dj £ (0, 1) for 1 < i < 2 and 1 < j < 3. In particular, determine conditions on a\ and 
02 that insure there is a choice of u = (111,112) which makes the system positive recurrent. Under 
those conditions, find an upper bound on X\ + X2 + -^3, and select u to mnimize the bound, 
(b) Generalize Example l.b to the scenario shown. In particular, can you find a version of route- 
to-shorter routing so that the bound found in part (a) still holds? 
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6.20 An inadequacy of a linear potential function 

Consider the system of Example 6.9.5 (a discrete-time model, using the route to shorter policy, 



with ties broken in favor of queue 1 , so u = I. 



{a;i<ai2 



}): 




queue 1 




queue 2 



Assume a = 0.7 and d\ — &i — 0.4. The system is positive recurrent. Explain why the function 
V(x) = x\ + X2 does not satisfy the Foster-Lyapunov stability criteria for positive recurrence, for 
any choice of the constant b and the finite set C. 

6.21 Allocation of service 

Prove the claim in Example 6.9.12 about the largest value of e. 

6.22 Opportunistic scheduling 

(Based on [14]) Suppose N queues are in parallel, and suppose the arrivals to a queue i form an 
independent, identically distributed sequence, with the number of arrivals in a given slot having 
mean at > and finite second moment K{. Let S(t) for each t be a subset of E = {1, . . . , N} and 
t > 0. The random sets S(t) : t > are assumed to be independent with common distribution w. 
The interpretation is that there is a single server, and in slot i, it can serve one packet from one 
of the queues in S(t). For example, the queues might be in the base station of a wireless network 
with packets queued for N mobile users, and S(t) denotes the set of mobile users that have working 
channels for time slot [t, t + 1). See the illustration: 




(a) Explain why the following condition is necessary for stability: For all s C E with s / 0, 



(6.37) 



(b) Consider u of the form u = (u(i,s) :ie£,sC E), with u(i,s) > 0, u(i,s) — if i £ s, and 
^2 ie Eu(i, s) = I{ S y£t$}- Suppose that given S(t) = s, the queue that is given a potential service 
opportunity has probability distribution (u(i, s) : i € E). Then the probability of a potential service 
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at queue i is given by fii(u) = ^ s u(i, s)w(s) for i G E. Show that under the condition (6.37), for 
some e > 0, u can be selected to that en + e < fj,i(u) for i G E. (Hint: Apply the min-cut, max- flow 
theorem to an appropriate graph.) 

(c) Show that using the u found in part (b) that the process is positive recurrent. 

(d) Suggest a dynamic scheduling method which does not require knowledge of the arrival rates or 
the distribution w, which yields the same bound on the mean sum of queue lengths found in part 
(b). 

6.23 Routing to two queues — continuous time model 

Give a continuous time analog of Examples 6.9.4 and 6.9.5. In particular, suppose that the arrival 
process is Poisson with rate A and the potential departure processes are Poisson with rates \i\ and 

A*2- 



6.24 Stability of two queues with transfers 

Let (Ai, A2, v, Hi, H2) be a vector of strictly positve parameters, and consider a system of two service 
stations with transfers as pictured. 



Station i has Possion arrivals at rate Aj and an exponential type server, with rate jUj. In addition, 
customers are transferred from station 1 to station 2 at rate uv, where u is a constant with u G 
U = [0, 1]. (Rather than applying dynamic programming here, we will apply the method of Foster- 
Lyapunov stability theory in continuous time.) The system is described by a continuous-time 
Markov process on Z+ with some transition rate matrix Q. (You don't need to write out Q.) 

(a) Under what condition on (Ai, A2, v, ix\,\X2) is there a choice of the constant u such that the 
Markov process describing the system is positive recurrent? 

2 2 

(b) Let V be the quadratic Lyapunov function, V(x\, X2) = -y- + -j-- Compute the drift vector QV. 

(c) Under the condition of part (a), and using the moment bound associated with the Foster- 
Lyapunov criteria, find an upper bound on the mean number in the system in equilibrium, X\ + Xi- 
(The smaller the bound the better.) 

6.25 Stability of a system with two queues and modulated server 

Consider two queues, queue 1 and queue 2, such that in each time slot, queue i receives a new 
packet with probability ai, where < a\ < 1 and < a<i < 1. Suppose the server is described by a 
three state Markov process, with transition probabilities depending on a constant b, with < b < |, 
as shown. 
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queue 1 



- server longer 



\ 2b 
queue 2 ~~ "" \ f 2 



(> 



If the server process is in state i for i g {1, 2} at the beginning of a slot, then a potential service 
is given to station i. If the server process is in state at the beginning of a slot, then a potential 
service is given to the longer queue (with ties broken in favor of queue 1). Then during the slot, 
the server state jumps with probability 26.. (Note that a packet can arrive and depart in one time 
slot.) For what values of a± and ai is the process stable? Briefly explain your answer (but rigorous 
proof is not required) . 



206 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS 



Chapter 7 

Basic Calculus of Random Processes 



The calculus of deterministic functions revolves around continuous functions, derivatives, and inte- 
grals. These concepts all involve the notion of limits. See the appendix for a review of continuity, 
differentiation and integration. In this chapter the same concepts are treated for random processes. 
We've seen four different senses in which a sequence of random variables can converge: almost 
surely (a.s.), in probability (p.), in mean square (m.s.), and in distribution (d.). Of these senses, 
we will use the mean square sense of convergence the most, and make use of the correlation version 
of the Cauchy criterion for m.s. convergence, and the associated facts that for m.s. convergence, 
the means of the limits are the limits of the means, and correlations of the limits are the limits of 
correlations (Proposition 2.2.3 and Corollaries 2.2.4 and 2.2.5). As an application of integration 
of random processes, ergodicity and the Karhunen-Loeve expansion are discussed. In addition, 
notation for complex-valued random processes is introduced. 

7.1 Continuity of random processes 

The topic of this section is the definition of continuity of a continuous-time random process, with 
a focus on continuity defined using m.s. convergence. Chapter 2 covers convergence of sequences. 
Limits for deterministic functions of a continuous variable can be defined in either of two equivalent 
ways. Specifically, a function / on R has a limit y at t , written as lim s _>t f{s) = y, if either of 
the two equivalent conditions is true: 

(1) (Definition based on e and 5) Given e > 0, there exists S > so that | /(s) — y |< e whenever 

\s — t \ < 5. 

(2) (Definition based on sequences) f(s n ) — > y for any sequence (s„) such that s n — > t . 

Let's check that (1) and (2) are equivalent. Suppose (1) is true, and let (s n ) be such that s n — > t . 
Let e > and then let 8 be as in condition (1). Since s n — > t , it follows that there exists n so 
that \s n — t \ < 5 for all n > n . But then |/(s n ) — y\ < e by the choice of 5. Thus, f(s n ) — > y. 
That is, (1) implies (2). 

For the converse direction, it suffices to prove the contrapositive: if (1) is not true then (2) is 
not true. Suppose (1) is not true. Then there exists an e > so that, for any n > 1, there exists a 

207 
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value s n with \s n — t \ < \ such that |/(s n ) — y\ > e. But then s n — > t , and yet f(s n ) ■/* y, so (2) 
is false. That is, not (1) implies not (2). This completes the proof that (1) and (2) are equivalent. 
Similarly, and by essentially the same reasons, convergence for a continuous- time random process 
can be defined using either e and 8, or using sequences, at least for limits in the p., m.s., or d. 
senses. As we will see, the situation is slightly different for a.s. limits. Let X = (Xt : f £ T) be a 
random process such that the index set T is equal to either all of R, or an interval in R, and fix 
t €T. 

Definition 7.1.1 (Limits for continuous-time random processes.) The process (Xt : t £ T) has 

limit Y at t : 

(i) in the m.s. sense, written lim s ^t o X s = Y m.s., if for any e > 0, there exists 8 > so that 
E[(X S — Y) 2 ] < e whenever s £ T and \s — t \ < 8. An equivalent condition is X Sn -V Y as 
n — > oo, whenever s n — > t . 

(ii) in probability, written lim s ^t o X s = Y p., if given any e > 0, there exists 8 > so that 
P{\X S — Y\ > e] < e whenever s £ T and \s — t \ < 8. An equivalent condition is X Sn —> Y as 
n — > oo, whenever s n — > t . 

(Hi) in distribution, written lim s ^t X s — Yd., if given any continuity point c of Fy and any e > 0, 
there exists 8 > so that \Fx t i(c, s) — Fy(c)\ < e whenever s £ T and \s—t \ < 8. An equivalent 

condition is X Sn —> Y as n — > oo, whenever s n — > t . (Recall that i 7 x,i(c, s) = P{X S < c}.) 

(iv) almost surely, written lim s ^t X s = Y a.s., if there is an event Ft a having probability one such 
that F to C {cj : lim s ^ to X s (u) = I'M}. 1 

The relationship among the above four types of convergence in continuous time is the same as 
the relationship among the four types of convergence of sequences, illustrated in Figure 2.8. That 
is, the following is true: 

Proposition 7.1.2 The following statements hold as s — > t for a fixed t in T : If either X s - L >' Y 
or X s — >' Y then X s -4 Y. If X s -4 Y. then X s -4 Y. Also, if there is a random variable Z with 
E[Z] < oo and \Xt\ < Z for all t, and if X s -4 Y then X s -4' Y 

Proof. As indicated in Definition 7.1.1, the first three types of convergence are equivalent to 
convergence along sequences, in the corresponding senses. The fourth type of convergence, namely 
a.s. convergence as s ^ t , implies convergence along sequences (Example 7.1.3 shows that the 
converse is not true). That is true because if (s n ) is a sequence converging to t , 

{to : lim X t (u) = Y(lo)} C {lo : lim X s (u) = Y(u)}. 

s—>t n— >oo 



This definition is complicated by the fact that the set {oj : lim s ^t X s (u>) — Y(cu)} involves uncountably many 
random variables, and it is not necessarily an event. There is a way to simplify the definition as follows, but it 
requires an extra assumption. A probability space (Q, T , P) is complete, if whenever iV is an event having probability 
zero, all subsets of N are events. If (£l,J-, P) is complete, the definition of lim s ^ to X s — Y a.s., is equivalent to the 
requirement that {uj : lim s ^ to X s (uj) = Y(uj)} be an event and have probability one. 
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Therefore, if the first of these sets contains an event which has probability one, the second of these 
sets is an event which has probability one. The proposition then follows from the same relations 
for convergence of sequences. In particular, a.s. convergence for continuous time implies a.s. 
convergence along sequences (as just shown), which implies convergence in p. along sequences, which 
is the same as convergence in probability. The other implications of the proposition follow directly 
from the same implications for sequences, and the fact the first three definitions of convergence for 
continuous time have a form based on sequences. I 

The following example shows that a.s. convergence as s — > t is strictly stronger than a.s. 
convergence along sequences. 

Example 7.1.3 Let U be uniformly distributed on the interval [0, 1]. Let Xt = 1 if t — U is a 
rational number, and Xt = otherwise. Each sample path of X takes values zero and one in any 
finite interval, so that X is not a.s. convergent at any t . However, for any fixed t, P{Xt = 0} = 1. 
Therefore, for any sequence s n , since there are only countably many terms, P{X Sn = for all n} = 1 
so that X. — > a.s. 



Definition 7.1 .4 (Four types of continuity at a point for a random process) For each t £ T fixed, 
the random process X = (X t : t € T) is continuous at t in any one of the four senses: m.s., p., 
a.s., or d., ifY\m. s ^t X s = Xt a in the corresponding sense. 

The following is immediately implied by Proposition 7.1.2. It shows that for convergence of a 
random process at a single point, the relations illustrated in Figure 2.8 again hold. 

Corollary 7.1.5 If X is continuous at t in either the a.s. or m.s. sense, then X is continuous at 
t in probability. If X is continuous at t in probability, then X is continuous at t in distribution. 
Also, if there is a random variable Z with E[Z] < oo and \X t \ < Z for all t, and if X is continuous 
at t in probability, then it is continuous at t in the m.s. sense. 

A deterministic function / on R is simply called continuous if it is continuous at all points. Since 
we have four senses of continuity at a point for a random process, this gives four types of continuity 
for random processes. Before stating them formally, we describe a fifth type of continuity of random 
processes, which is often used in applications. Recall that for a fixed to £ £1, the random process 
X gives a sample path, which is a function on T. Continuity of a sample path is thus defined as 
it is for any deterministic function. The subset of 0, {uj : Xt(u) is a continuous function of t}, or 
more concisely, {Xt is a continuous function of t}, is the set of u such that the sample path for ui 
is continuous. The fifth type of continuity requires that the sample paths be continuous, if a set 
probability zero is ignored. 
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Definition 7.1.6 (Five types of continuity for a whole random process) A random process 
X = (X t : t € T) is said to be 

m.s. continuous if it is m.s. continuous at each t 

continuous in p. if it is continuous in p. at each t 

continuous in d. if it is continuous in d. at each t 

a.s. continuous at each t, if, as the phrase indicates, it is a.s. continuous at each t? 

a.s. sample-path continuous, if F C {Xt is continuous in t} for some event F with P(F) = 1. 

The relationship among the five types of continuity for a whole random process is pictured in 
Figure 7.1, and is summarized in the following proposition. 

a.s. continuous at each t 




a.s. sample-path continuous — -i P- ^> <*■ 



m.s 




Figure 7.1: Relationships among five types of continuity of random processes. 

Proposition 7.1.7 If a process is a.s. sample-path continuous it is a.s. continuous at each t. If 
a process is a.s. continuous at each t or m.s. continuous, it is continuous in p. If a process is 
continuous in p. it is continuous in d. Also, if there is a random variable Y with E\Y 2 ] < oo and 
\X t \ < Y for all t, and if X is continuous in p., then X is m.s. continuous. 

Proof. Suppose X is a.s. sample-path continuous. Then for any t £ T, 

{uj : Xt{uo) is continuous at all t £ T} C {w : Xt(u) is continuous at t } (7.1) 

Since X is a.s. sample-path continuous, the set on the left-hand side of (7.1) contains an event F 
with P(F) = 1 and F is also a subset of the set on the the right-hand side of (7.1). Thus, X is 
a.s. continuous at t . Since t was an arbitrary element of T, if follows that X is a.s. continuous at 
each t. The remaining implications of the proposition follow from Corollary 7.1.5. I 



We avoid using the terminology "a.s. continuous" for the whole random process, because such terminology could 
too easily be confused with a.s. sample-path continuous 
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Example 7.1.8 (Shows a.s. sample-path continuity is strictly stronger than a.s. continuity at each 
t.) Let X = (Xt : < t < t) be given by Xt = I{t>U} f° r < i < 1, where U is uniformly distributed 
over [0, 1]. Thus, each sample path of X has a single upward jump of size one, at a random time U 
uniformly distributed over [0, 1]. So every sample path is discontinuous, and therefore X is not a.s. 
sample-path continuous. For any fixed t and to, if U(lo) / t (i.e. if the jump of X is not exactly at 
time t) then X s (uj) — > Xt(to) as s — > t. Since P{£7 / £} = 1, it follows that X is a.s. continuous at 
each t. Therefore X is also continuous in p. and d. senses. Finally, since \Xt\ < 1 for all t and X is 
continuous in p., it is also m.s. continuous. 

The remainder of this section focuses on m.s. continuity. Recall that the definition of m.s. 
convergence of a sequence of random variables requires that the random variables have finite second 
moments, and consequently the limit also has a finite second moment. Thus, in order for a random 
process X = (Xt : f 6 T) to be continuous in the m.s. sense, it must be a second order process: 
-EpQ 2 ] < oo for all teT. Whether X is m.s. continuous depends only on the correlation function 
Rx, as shown in the following proposition. 

Proposition 7.1.9 Suppose (Xt :ieT) is a second order process. The following are equivalent: 

(i) Rx is continuous at all points of the form (t, t) (This condition involves Rx for points in and 
near the set of points of the form (t,t). It is stronger than requiring Rx(t,t) to be continuous 
in t-see example 7.1.10.) 

(ii) X is m.s. continuous 

(Hi) Rx is continuous over T x T. 

If X is m.s. continuous, then the mean function, fix(t), is continuous. If X is wide sense stationary, 
the following are equivalent: 

(i! ) Rx(t) is continuous at r = 

(ii' ) X is m.s. continuous 

(iii! ) Rx(t) is continuous over all ofM. 

Proof, ((i) implies (ii)) Fix t G T and suppose that Rx is continuous at the point (t, t). Then 
Rx(s, s), Rx(s, t), and Rx(t, s) all converge to Rx(t, t) as s — > t. Therefore, lim s ^t E[(X S — Xt) 2 ] = 
]im. s ^ t (Rx(s, s) — Rx(s,t) — Rx(t,s) + Rx(t,t)) = 0. So X is m.s. continuous at t. Therefore if 
Rx is continuous at all points of the form (t, t) G T x T, then X is m.s. continuous at all t G T. 
Therefore (i) implies (ii). 

((H) implies (Hi)) Suppose condition (ii) is true. Let (s, t) G T x T, and suppose (s n , t n ) G T x T 
for all n > 1 such that lim n ^ 00 (s n , t n ) = (s,t). Therefore, s n — > s and t n — > t as n — > oo. By 
condition (b), it follows that X Sn -V X s and Xt n —>' X t as n — > oo. Since the limit of the 
correlations is the correlation of the limit for a pair of m.s. convergent sequences (Corollary 2.2.4) 
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it follows that Rx(s n ,t n ) — > Rx(s,t) as n — > oo. Thus, i?x is continuous at (s,t), where (s,t) was 
an arbitrary point of T x T. Therefore Rx is continuous over T x T, proving that (ii) implies (iii). 

Obviously (iii) implies (i), so the proof of the equivalence of (i)-(iii) is complete. 

If X is m.s. continuous, then, by definition, for any t G T, X s -V Xt as s — > t. It thus follows 
that Hx(s) —* Mx(i)j because the limit of the means is the mean of the limit, for a m.s. convergent 
sequence (Corollary 2.2.5). Thus, m.s. continuity of X implies that the deterministic mean function, 
Hx, is continuous. 

Finally, if X is WSS, then Rx(s, t) = Rx{t) where r = s — t, and the three conditions (i)-(iii) 
become (i')-(m'), so the equivalence of (i)-(iii) implies the equivalence of (i')-(iii'). I 



Example 7.1.10 Let X = (X t : t G R) be defined by X t = U for t < and X t = V for t > 0, 
where U and V are independent random variables with mean zero and variance one. Let t n be a 
sequence of strictly negative numbers converging to 0. Then X tn = U for all n and Xq = V. Since 
P{\U — V\ > e} / for e small enough, JQ n does not converge to Xq in p. sense. So X is not 
continuous in probability at zero. It is thus not continuous in the m.s or a.s. sense at zero either. The 
only one of the five senses that the whole process could be continuous is continuous in distribution. 
The process X is continuous in distribution if and only if U and V have the same distribution. 
Finally, let us check the continuity properties of the autocorrelation function. The autocorrelation 
function is given by Rx(s, t) — 1 if either s, t < or if s, t > 0, and Rx(s, t) = otherwise. So Rx 
is not continuous at (0,0), because R(^,—^) = for all n > 1, so R(^,—}[) ~h -Rx(0,0) = 1. as 
n — > oo. However, it is true that Rx(t,t) = 1 for all t, so that Rx(t,t) is a continuous function of 
t. This illustrates the fact that continuity of the function of two variables, Rx(s, t), at a particular 
fixed point (t , t ), is a stronger requirement than continuity of the function of one variable, Rx(t, t), 
at t == Zq. 



Example 7.1.11 Let W = (Wt : t > 0) be a Brownian motion with parameter a 2 . Then ^[(Wj — 
VFs) 2 ] = a 2 \t — s\ — > as s — > t. Therefore W is m.s. continuous. Another way to show W is 
m.s. continuous is to observe that the autocorrelation function, Rw(s,t) = a 2 (s At), is continuous. 
Since W is m.s. continuous, it is also continuous in the p. and d. senses. As we stated in defining 
W, it is a.s. sample-path continuous, and therefore a.s. continuous at each t > 0, as well. 



Example 7.1.12 Let ./V = (N t : t > 0) be a Poisson process with rate A > 0. Then for fixed t, 

E[(Nt — N s ) 2 ] = X(t — s) + (A(t — s)) 2 — > as s — > t. Therefore A^ is m.s. continuous. As required, 
i?jv, given by RN(s,t) = A(s At) + A 2 st, is continuous. Since A^ is m.s. continuous, it is also 
continuous in the p. and d. senses. TV is also a.s. continuous at any fixed t, because the probability 
of a jump at exactly time t is zero for any fixed t. However, A^ is not a.s. sample continuous. In 
fact, P[N is continuous on [0, a]] = e~ Xa and so P[A^ is continuous on R + ] = 0. 
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Definition 7.1.13 A random process (Xt : t €.T), such that T is a bounded interval (open, closed, 
or mixed) in R with endpoints a < b, is piecewise m.s. continuous, if there exist n > 1 and 
a = to < t\ < • • • < t n = b, such that, for 1 < k < n: X is m.s. continuous over (£&_!,£&) and has 
m.s. limits at the endpoints o/ (£&_!,£&). 

More generally, if T is all o/R or an interval in R, X is piecewise m.s. continuous over T if it is 
piecewise m.s. continuous over every bounded subinterval ofT. 

7.2 Mean square differentiation of random processes 

Before considering the m.s. derivative of a random process, we review the definition of the derivative 
of a function (also, see Appendix 11.4). Let the index set T be either all of R or an interval in 
R. Suppose / is a deterministic function on T. Recall that for a fixed t in T, / is differentiable 
at t if lim s ^i s Z.l ex ist s an d is finite, and if / is differentiable at t, the value of the limit is 
the derivative, f'{t). The whole function / is called differentiable if it is differentiable at all t. The 
function / is called continuously differentiable if / is differentiable, and the derivative function /' 
is continuous. 

In many applications of calculus, it is important that a function / be not only differentiable, 
but continuously differentiable. In much of the applied literature, when there is an assumption 
that a function is differentiable, it is understood that the function is continuously differentiable. 
For example, by the fundamental theorem of calculus, 

fib) - f(a) = [ f'(s)ds (7.2) 



holds if / is a continuously differentiable function with derivative /'. Example 11.4.2 shows that 
(7.2) might not hold if / is simply assumed to be differentiable. 

Let X = (Xt : t £ T) be a second order random process such that the index set T is equal to 
either all of R or an interval in R. The following definition for m.s. derivatives is analogous to the 
definition of derivatives for deterministic functions. 

Definition 7.2.1 For each t fixed, the random process X = (Xt : t £ T) is mean square (m.s.) 
differentiable at t if the following limit exists 

lim x »~ Xt m .s. 

The limit, if it exists, is the m.s. derivative of X at t, denoted by X' t . The whole random process X is 
said to be m.s. differentiable if it is m.s. differentiable at each t, and it is said to be m.s. continuously 
differentiable if it is m.s. differentiable and the derivative process X' is m.s. continuous. 

Let di denote the operation of taking the partial derivative with respect to the ith argument. 
For example, if f(x,y) = x 2 y 3 then ^/(x, y) = 3x 2 y 2 and di&2f(x,y) = 6xy 2 . The partial 
derivative of a function is the same as the ordinary derivative with respect to one variable, with 
the other variables held fixed. We shall be applying d\ and 82 to an autocorrelation function 
Rx = (Rx(s,t) : (s,t) £ T x T}, which is a function of two variables. 
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Proposition 7.2.2 (a) (The derivative of the mean is the mean of the derivative) If X is m.s. 
differentiable, then the mean function fix is differentiable, and fi' x {t) = fix>{t). (i-e. the 
operations of (i) taking expectation, which basically involves integrating over u>, and (ii) dif- 
ferentiation with respect to t, can be done in either order.) 

(b) If X is m.s. differentiable, the cross correlation functions are given by Rx'x — d\Rx and 

Rxx' — d2Rx, and the autocorrelation function of X' is given by Rx> = d\d2Rx = <hd\Rx- 
(In particular, the indicated partial derivatives exist.) 

(c) X is m.s. differentiable at t if and only if the following limit exists and is finite: 

Rx(s,s')-Rx(s,t)-Rx(t,s') + R x (t,t) 

lim — r . (7.3) 

' * (s-t)(s'-t) v ; 



.s..s ; 



(Therefore, the whole process X is m.s. differentiable if and only if the limit in (7.3) exists 
and is finite for all t £ T.) 

(d) X is m.s. continuously differentiable if and only if Rx, c^-Rx, and d\d2Rx exist and are 

continuous. (By symmetry, if X is m.s. continuously differentiable, then also d\Rx is con- 
tinuous.) 

(e) (Specialization of (d) for WSS case) Suppose X is WSS. Then X is m.s. continuously differ- 

entiable if and only if Rxij), R' x (t), and R x (t) exist and are continuous functions of r . If 
X is m.s. continuously differentiable then X and X' are jointly WSS, X' has mean zero (i.e. 
Hx> = 0) and autocorrelation function given by R X i{t) = —R x (r), and the cross correlation 
functions are given by Rx'xij) = R' x ir) and Rxx'ij) = —R' x (t). 

(f) (A necessary condition for m.s. differentiability) If X is WSS and m.s. differentiable, then 
R' x (0) exists and R' x (0) = 0. 

(g) If X is a m.s. differentiable Gaussian process, then X and its derivative process X' are jointly 

Gaussian. 

Proof, (a) Suppose X is m.s. differentiable. Then for any t fixed, 

X s — Xt 
s-t 

It thus follows that 



X[ as s — > t. 



Hx(s)-Hx(t) ,. < , 
— > HX'it) ass-tt, (74) 

because the limit of the means is the mean of the limit, for a m.s. convergent sequence (Corol- 
lary 2.2.5). But (7.4) is just the definition of the statement that the derivative of fix at t is equal 
to fix'(t). That is, -gr"(£) = Hx'{t) for all t, or more concisely, fi' x = fi X i. 

(b) Suppose X is m.s. differentiable. Since the limit of the correlations is the correlation of the 
limits for m.s. convergent sequences (Corollary 2.2.4), for t,t' £ T, 



Rx'x(t,t') =limE 



x(.)-xw, 



s^t S — t 
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Thus, Rx'X — 9\Rx, and in particular, the partial derivative d\Rx exists. Similarly, Rxx 1 
d 2 Rx- Also, by the same reasoning, 



Rx>(t,t') 



lim E 



X'(t) 



X(s')-X(t>) 



= lim 



s'-t' 

R X 'x{t,s')-R X 'x(t,t r 



s'-t' 
= d 2 R x , x (t,t') = d 2 d 1 Rx(t,t'), 

so that Rx' = d2d\Rx- Similarly, Rx' = d\d\Rx- 

(c) By the correlation form of the Cauchy criterion, (Proposition 2.2.3), X is m.s. differentiable 
at t if and only if the following limit exists and is finite: 



lim E 

s,s'— >t 



X(s)-X(t) \( X(s')-X(t) 
s-t J V s'-t 



(7.5) 



Multiplying out the terms in the numerator in the right side of (7.5) and using E[X(s)X(s')] = 
Rx(s,s'), E[X(s)X(t)] = Rx(s,t), and so on, shows that (7.5) is equivalent to (7.3). So part (c) 
is proved. 

(d) The numerator in (7.3) involves Rx evaluated at the four courners of the rectangle [t, s] x 
[t, s'], shown in Figure 7.2. Suppose Rx, d 2 Rx and did 2 Rx exist and are continuous functions. 



Figure 7.2: Sampling points of Rx- 
Then by the fundamental theorem of calculus, 



(Rx(s,s f ) - R x (s,t)) - (Rx(t,s f ) - R x (t,t)) 



d 2 Rx(s, v)dv 



d 2 Rx(t, v)dv 



[d 2 R x (s,v) - d 2 R x (t,v)]dv 



did 2 Rx(u, v)dudv. 



(7.6) 



Therefore, the ratio in (7.3) is the average value of d±d 2 Rx over the rectangle [t, s] x [t, s']. Since 
d\d 2 Rx is assumed to be continuous, the limit in (7.3) exists and it is equal to d\d 2 Rx(t, t). There- 
fore, by part (c) already proved, X is m.s. differentiable. By part (b), the autocorrelation function 
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of X' is d\d2Rx- Since this is assumed to be continuous, it follows that X' is m.s. continuous. Thus, 
X is m.s. continuously differentiable. 

(e) If X is WSS, then Rx(s — t) = Rxij) where r = s — t. Suppose Rx(t), R' x {j) an d R"x( T ) 
exist and are continuous functions of r. Then 

d!R x {s,t) = R' x (r) and d 2 d 1 R x (s, t) = -R" x {t). (7.7) 

The minus sign in (7.7) appears because Rx(s,t) = Rx{t) where r = s — t, and the derivative of 
with respect to t is — 1. So, the hypotheses of part (d) hold, so that X is m.s. differentiable. Since 
X is WSS, its mean function \xx is constant, which has derivative zero, so X' has mean zero. Also 
by part (c) and (7.7), Rx>x{j) = R' X { T ) an d Rx'X 1 = ~R"x- Similarly, Rxx'ij) = —R' x (t). Note 
that X and X' are each WSS and the cross correlation functions depend on r alone, so X and X' 
are jointly WSS. 

(f) If X is WSS then 



E 



X(t)-X(0) x 2 
t 



2(R x (t) - R X (0)) 

T 2 ( 7 - 8 ) 



Therefore, if X is m.s. differentiable then the right side of (7.8) must converge to a finite limit as 
t — > 0, so in particular it is necessary that (R x (t) — Rx(0))/t — > as t — > 0. Therefore R' x (0) = 0. 
(g) The derivative process X' is obtained by taking linear combinations and m.s. limits of 
random variables in X = (Xt;t £ T). Therefore, (g) follows from the fact that the joint Gaussian 
property is preserved under linear combinations and limits (Proposition 3.4.3(c)). I 



Example 7.2.3 Let f(t) = t 2 sm(l/t 2 ) for t ^ and /(0) = as in Example 11.4.2, and let 
X = (Xf : t G K) be the deterministic random process such that X(t) = f(t) for all iel. Since X 
is differentiable as an ordinary function, it is also m.s. differentiable, and its m.s. derivative X' is 
equal to /'. Since X', as a deterministic function, is not continuous at zero, it is also not continuous 
at zero in the m.s. sense. We have R x (s,t) = f(s)f(t) and d2Rx( s ,t) — f( s )f'(t)> which is not 
continuous. So indeed the conditions of Proposition 7.2.2(d) do not hold, as required. 



Example 7.2.4 A Brownian motion W = (Wt : t > 0) is not m.s. differentiable. If it were, then 
for any fixed t > 0, — s Zf would converge in the m.s. sense as s — > t to a random variable 
with a finite second moment. For a m.s. convergent seqence, the second moments of the variables 
in the sequence converge to the second moment of the limit random variable, which is finite. But 
W(s) — W(t) has mean zero and variance a 2 \s — t\, so that 



lim£' 



W(s)-W(t)\ 2 ' 



- 



° 2 , X 

lim 1 r = +oo. (7.9) 

s->t \s — t\ 



Thus, W is not m.s. differentiable at any t. For another approach, we could appeal to Proposition 
7.2.2 to deduce this result. The limit in (7.9) is the same as the limit in (7.5), but with s and s' 
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restricted to be equal. Hence (7.5), or equivalently (7.3), is not a finite limit, implying that W is 
not differentiable at t. 

Similarly, a Poisson process is not m.s. differentiable at any t. A WSS process X with Rx{t) = 
e~ a ' T ' is not m.s. differentiable because i?^(0) does not exist. A WSS process X with Rx(t) = 1+ 2 
is m.s. differentiable, and its derivative process X' is WSS with mean and covariance function 

1 \" 2-6r 2 



R X'{r) - i , , _., , - , , , 2 v 3 - 



1 + rV (1 + 



T^ 



Proposition 7.2.5 Suppose X is a m.s. differentiable random process and f is a differentiable 
function. Then the product X f = (X(t)f(t) : t £ R) is mean square differentiable and (Xf)' = 
X'f + Xf. 

Proof: Fix t. Then for each s ^ t, 

X{s)f{s)-X{t)f{t) = {X{s)-X{t))f{s) | X{t){f{s)-f{t)) 
s — t s — t s — t 

"^' X'(t)f(t) + X(t)f'(t) as s -> t. 

Definition 7.2.6 A random process X on a bounded interval (open, closed, or mixed) with end- 
points a < b is continuous and piecewise continuously differentiable in the m.s. sense, if X is m.s. 
continuous over the interval, and if there exists n > 1 and a = to < t\ < ■ ■ ■ < t n = b, such that, 
for 1 < k < n: X is m.s. continuously differentiable over (tk-i,tk) and X' has finite limits at the 
endpoints of (tk-i,tk). 

More generally, if T is all of R or a subinterval of R, then a random process X = {Xt : £ £ T) 
is continuous and piecewise continuously differentiable in the m.s. sense if its restriction to any 
bounded interval is continuous and piecewise continuously differentiable in the m.s. sense. 

7.3 Integration of random processes 

Let X = {Xt : a < t < b) be a random process and let h be a function on a finite interval [a, b]. 
How shall we define the following integral? 

j h a X t h(t)dt. (7.10) 

One approach is to note that for each fixed u, Xt{u)) is a deterministic function of time, and so 
the integral can be defined as the integral of a deterministic function for each u. We shall focus 
on another approach, namely mean square (m.s.) integration. An advantage of m.s. integration is 
that it relies much less on properties of sample paths of random processes. 

As for integration of deterministic functions, the m.s. Riemann integrals are based on Riemann 
sums, defined as follows. Given: 
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• A partition of (a, b] of the form (to, h], (ti, £2], • • • , (t n -i, t n ], where n > and 
a = to < ti ■ ■ ■ < t n = b 

• A sampling point from each subinterval, Vk G (tk-i,tk], for 1 < k < n, 
the corresponding Riemann sum for Xh is defined by 

n 

^A"„ fe /i(v fc )(£ fc -t fc _i). 
fc=i 

The norm of the partition is defined to be max/% |£j. — £fc-i|- 

Definition 7.3.1 T/ie Riemann integral J X t h{t)dt is said to exist in the m.s. sense and its 
value is the random variable I if the following is true. Given any e > 0, there is a 8 > so that 
E[{Ylk=i X Vk h(vk)(tk — tk-i) — I) 2 } < e whenever the norm of the partition is less than or equal to 
5. This definition is equivalent to the following condition, expressed using convergence of sequences. 
The m.s. Riemann integral exists and is equal to I, if for any sequence of partitions, specified by 
((£i\ ^2"' • ■ • i^n ) • m — 1)' with corresponding sampling points ((fj™, . . . , uJJ 1 ) : m > 1), such that 
norm of the m th partition converges to zero as m — > 00, the corresponding sequence of Riemann 
sums converges in the m.s. sense to I as m — > 00. The process Xfh(t) is said to be m.s. Riemann 
integrable over (a, b] if the integral f Xth{t)dt exists and is finite. 

Next, suppose X t h(t) is defined over the whole real line. If X t h(t) is m.s. Riemann integrable over 
every bounded interval [a,b], then the Riemann integral of X t h(t) over R is defined by 



X t h(t)dt = lim / X t hit)dt m.s. 

) a,b^oo J_ a 

provided that the indicated limit exist as a,b jointly converge to +00. 

Whether an integral exists in the m.s. sense is determined by the autocorrelation function of 
the random process involved, as shown next. The condition involves Riemann integration of a 
deterministic function of two variables. As reviewed in Appendix 11.5, a two-dimensional Riemann 
integral over a bounded rectangle is defined as the limit of Riemann sums corresponding to a 
partition of the rectangle into subrectangles and choices of sampling points within the sub rectangles. 
If the sampling points for the Riemann sums are required to be horizontally and vertically alligned, 
then we say the two-dimensional Riemann integral exists with aligned sampling. 

Proposition 7.3.2 The integral J a X t h(t)dt exists in the m.s. Riemann sense if and only if 

J b a J b a R x {s,t)h{s)h{t)dsdt (7.11) 

exists as a two dimensional Riemann integral with aligned sampling. The m.s. integral exists, in 
particular, if X is m.s. piecewise continuous over [a,b] and h is piecewise continuous over [a,b\. 



7.3. INTEGRATION OF RANDOM PROCESSES 219 

Proof. By definition, the m.s. integral of Xfh(t) exists if and only if the Riemann sums converge 
in the m.s. sense for an arbitary sequence of partitions and sampling points, such that the norms 
of the partitions converge to zero. So consider an arbitrary sequence of partitions of (a, b] into 
intervals specified by the collection of endpoints, ((tg 1 ,^, . . . , t™ m ) : m > 1), with corresponding 
sampling point v™ £ (t™_\, t™] for each m and 1 < k < n m , such that the norm of the m partition 
converges to zero as m — > oo. For each m > 1, let S m denote the corresponding Riemann sum: 



fc=l 



By the correlation form of the Cauchy criterion for m.s. convergence (Proposition 2.2.3), (S m 
m > 1) converges in the m.s. sense if and only if linim^/^oo E[S m S m /] exists and is finite. Now 



E[s m s m ,\ = Y J Y. R ^ v T^')KvT)KvT'){tT - tJLiX*™' - C-i), (7.12) 

j=i fc=i 



and the right-hand side of (7.12) is the Riemann sum for the integral (7.11), for the partition of 
(a, b] x (a, b] into rectangles of the form (t 1 J l _ 1 ,t" 1 ] x (t™_ 1 ,t™ ] and the sampling points (w" 1 ,!'™ ). 
Note that the mm' sampling points are aligned, in that they are determined by the m + m! num- 
bers v™, . . . , v™m , v™ , . . • , v m m , . Moreover, any Riemann sum for the integral (7.11) with aligned 
sampling can arise in this way. Further, as m,m' — > oo, the norm of this partition, which is the 
maximum length or width of any rectangle of the partition, converges to zero. Thus, the limit 
lininj^/^oo E[S m S m '} exists for any sequence of partitions and sampling points if and only if the 
integral (7.11) exists as a two-dimensional Riemann integral with aligned sampling. 

Finally, if X is piecewise m.s. continuous over [a, b] and h is piecewise continuous over [a, b], then 
there is a partition of [a, b] into intervals of the form (s^-i, Sfc] such that X is m.s. continuous over 
(sjk_i,Sjt) with m.s. limits at the endpoints, and h is continuous over (sfe-i,Sfe) with finite limits 
at the endpoints. Therefore, Rx(s,t)h(s)h(t) restricted to each rectangle of the form (sj-i,Sj) x 
(sjk-i, Sfc), is the restriction of a continuous function on [sj~i, Sj] x [sk-i, Sk]- Thus Rx(s, t)h(s)h(t) 
is Riemann integrable over [a, b] x [a, b]. I 
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Proposition 7.3.3 Suppose Xfhit) and Yfk(t) are both m.s. integrable over [a,b]. Then 



E 



E 



X t h(t)dt 

b x 2" 

X t h(t)dt 



Hx(t)h(t)dt 

I, r b 

Rx(s,t)h(s)h(t)dsdt 



E 



Var( I X t h(t)dt 

a 
rb 



a J a 
b rb 



X s h(s)ds / Y t k(t)dt 



rb rb 

Cov ( / X s h(s)ds, / Y t k{t)dt 

rb 

X t h(t) + Y t k(t)dt 



a J a 
b r b 



a J a 
b r b 



C x (s,t)h(s)h(t)dsdt. 

RxY(s,t)h{s)k(t)dsdt 

C X Y(s,t)h(s)k(t)dsdt 



a J a 
b 



X t h(t)dt + / Y t k(t))dt 



(7.13) 
(7.14) 

(7.15) 
(7.16) 
(7.17) 
(7.18) 



Proof. Let (S m ) denote the sequence of Riemann sums appearing in the proof of Proposition 
7.3.2. Since the mean of a m.s. convergent sequence of random variables is the limit of the means 
(Corollary 2.2.5), 



E 



X t h{t)dt 



= lim E[S m ] 

riyn 

= lim 5>x(0«)(C-^-i)- 



(7.19) 



fc=i 



The right-hand side of (7.19) is a limit of Riemann sums for the integral J fix(t)h(t)dt. Since this 



limit exists and is equal to E 



J Xth{t)dt for any sequence of partitions and sample points, it 



follows that j fix(t)h(t)dt exists as a Riemann integral, and is equal to E J Xfh(t)dt , so (7.13) 
is proved. 

The second moment of the m.s. limit of {S m '■ m > 0) is equal to linin^nj/^oo E[S m S m >], by the 
correlation form of the Cauchy criterion for m.s. convergence (Proposition 2.2.3), which implies 
(7.14). It follows from (7.13) that 



E 



b N ^ 

X t h(t)dt 



b rb 



l^x(s) /ix(t)h(s)h(t)dsdt 



Subtracting each side of this from the corresponding side of (7.14) yields (7.15). The proofs of 
(7.16) and (7.17) are similar to the proofs of (7.14) and (7.15), and are left to the reader. 

For any partition of [a, b] and choice of sampling points, the Riemann sums for the three integrals 
appearing (7.17) satisfy the corresponding additivity condition, implying (7.17). I 
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The fundamental theorem of calculus, stated in Appendix 11.5, states the increments of a 
continuous, piecewise continuously differentiable function are equal to integrals of the derivative of 
the function. The following is the generalization of the fundamental theorem of calculus to the m.s. 
calculus. 

Theorem 7.3.4 (Fundamental Theorem of m.s. Calculus) Let X be a m.s. continuously differen- 
tiable random process. Then for a < b, 

f b 
Xb — X a = / X' t dt (m.s. Riemann integral) (7.20) 

J a 

More generally, if X is continuous and piecewise continuously differentiable, (11.4) holds with X' t 
replaced by the right-hand derivative, D + Xt- (Note that D + Xt = X[ whenever X[ is defined.) 

Proof. The m.s. Riemann integral in (7.20) exists because X' is assumed to be m.s. continuous. 
Let B = Xb — X a — f Xj-dt, and let Y be an arbitrary random variable with a finite second moment. 
It suffices to show that £?[Y"B] = 0, because a possible choice of Y is B itself. Let <f>(t) = E[YXt\. 
Then for s ^ t, 



<f>(s) - </>(t) 



t 



E 



Y 



X s — X t 
s-t 



Taking a limit as s — > t and using the fact the correlation of a limit is the limit of the correlations 
for m.s. convergent sequences, it follows that <f> is differentiable and <p'(t) = E\YX!j\. Since X' is 
m.s. continuous, it similarly follows that <f>' is continuous. 

Next, we use the fact that the integral in (7.20) is the m.s. limit of Riemann sums, with each 
Riemann sum corresponding to a partition of (a, b] specified by some n > 1 and a = to < ■ ■ ■ < t n = b 
and sampling points v k £ (tjfc-i, tfc] for a < k < n. Since the limit of the correlation is the correlation 
of the limt for m.s. convergence, 



E 



Y I X' t dt 



lim E 

l*fe— ifc_i|— >0 



\t k 



Yj2K k (tk-t k -i) 
fc=l 

" ft) 

lim ^2<f)'(v k )(tk - t k -i) = / <f>'(t)dt 
*fc-i|- >0 fc=1 J a 



Therefore, £/[Y"S] = <j)(b) — <j){a) — J <f)'(t)dt, which is equal to zero by the fundamental theorem 
of calculus for deterministic continuously differentiable functions. This establishes (7.20) in case 
X is m.s. continuously differentiable. If X is m.s. continuous and only piecewise continuously 
differentiable, we can use essentially the same proof, observing that <j> is continuous and piecewise 
continuously differentiable, so that E[YB] = <p(b) — <p(a) — j <p'(t)dt = by the fundamental 
theorem of calculus for deterministic continuous, piecewise continuously differential functions. I 



Proposition 7.3.5 Suppose X is a Gaussian random process. Then X, together with all mean 
square derivatives of X that exist, and all m.s. Riemann integrals of X of the form I(a,b) = 
f Xth{t)dt that exist, are jointly Gaussian. 
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Proof. The m.s. derivatives and integrals of X are obtained by taking m.s. limits of linear 
combinations of X = (Xt;t £ T). Therefore, the proposition follows from the fact that the joint 
Gaussian property is preserved under linear combinations and limits (Proposition 3.4.3(c)). I 



Theoretical Exercise Suppose X = (Xt : t > 0) is a random process such that Rx is continuous. 
Let Yt = L X s ds. Show that Y is m.s. differentiable, and P\Y( = Xt] = 1 for t > 0. 

Example 7.3.6 Let (Wt ■ t > 0) be a Brownian motion with a 2 = 1, and let X t = L W s ds for 
t > 0. Let us find Rx and -P[|-Xt| > t] for t > 0. Since Rw(u,v) = u A v, 



R x (s,t) = E / W u Az / W„cfo 
yo Jo 

s rt 

I (u A v)dvdu. 
o ./o 

To proceed, consider first the case s > t and partition the region of integration into three parts as 
shown in Figure 7.3. The contributions from the two triangular subregions is the same, so 




Figure 7.3: Partition of region of integration. 



Rx(s,t) 



pt pu ps pt 

2 / vdvdu + I / vdvdu 
Jo Jo Jt Jo 

f3 +2( 



i* i\s - t) 

3 + 2 



t 2 s t 3 



2 6 
Still assuming that s > t, this expression can be rewritten as 

st(sAt) (sAtf 



Rx(s,t) 



ii 



(7.21) 



Although we have found (7.21) only for s > t, both sides are symmetric in s and t. Thus (7.21) 
holds for all s,t. 
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Since W is a Gaussian process, X is a Gaussian process. Also, £?pQ] = (because W is mean 



zero) and E[X^\ = Rx(t,t) = y. Thus, 



P[\Xt\ > t] 



Note that P[\X t \ > t] -> 1 as t -> +oo. 




2Q 



Example 7.3.7 Let N = (N t : t > 0) be a second order process with a continuous autocorrelation 
function Rn and let xq be a constant. Consider the problem of finding a m.s. differentiable random 
process X = (X t : t > 0) satisfying the linear differential equation 



X; = -X t + iV t , X = x . 



(7.22) 



Guided by the case that Nt is a smooth nonrandom function, we write 

ft 



X t 



xoe-*+ / e- {t - v) N v dv 
Jo 



or 



X t 



x e 



-* + e"* / e v JV„d«. 



(7.23) 



(7.24) 



Using Proposition 7.2.5, it is not difficult to check that (7.24) indeed gives the solution to (7.22). 

Next, let us find the mean and autocovariance functions of X in terms of those of N. Taking 
the expectation on each side of (7.23) yields 



Vx(t) 



x e + 



-(*-«) 



jjln{v)cLv. 



(7.25) 



A different way to derive (7.25) is to take expectations in (7.22) to yield the deterministic linear 
differential equation: 



a4(*) 



-fix(t) + m(t); mx(o) = x 



which can be solved to yield (7.25). To summarize, we found two methods to start with the 
stochastic differential equation (7.23) to derive (7.25), thereby expressing the mean function of the 
solution X in terms of the mean function of the driving process N. The first is to solve (7.22) to 
obtain (7.23) and then take expectations, the second is to take expectations first and then solve 
the deterministic differential equation for fix- 
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The same two methods can be used to express the covariance function of X in terms of the 
covariance function of N. For the first method, we use (7.23) to obtain 

C x (s,t) = Cov (x e- s + I e-( s - u) N u du, x e~* + I e'^'^Nydv 



S ft 

e-^-^e-^^Cffiu, v)dvdu. (7.26) 

JO 



The second method is to derive deterministic differential equations. To begin, note that 
8 1 C x (s,t) = Cov(X' a ,X t ) = Cov(-X s + N s ,X t ) 



so 



d!C x (s,t) = -C x (s,t) + C NX (s,t). (7.27) 

For t fixed, this is a differential equation in s. Also, Cx(0,t) = 0. If somehow the cross covariance 
function Cxx is found, (7.27) and the boundary condition Cx(0,t) = can be used to find Cx- 
So we turn next to finding a differential equation for Cnx- 

d 2 C NX (s,t) = Cov(N s ,X' t ) = Cov(N s ,-X t + N t ) 



so 



d 2 C N x{s,t) = -C NX {s,t) + C N (s,t). (7.28) 

For s fixed, this is a differential equation in t with initial condition Cnx(s,0) = 0. Solving (7.28) 
yields 

C NX (s,t) = f e- ( - t - v) C N (s,v)dv. (7.29) 

Using (7.29) to replace C NX in (7.27) and solving (7.27) yields (7.26). 



7.4 Ergodicity 

Let X be a stationary or WSS random process. Ergodicity generally means that certain time 
averages are asymptotically equal to certain statistical averages. For example, suppose 
X = (X t : i G 1) is WSS and m.s. continuous. The mean p^x is defined as a statistical average: 
fix = E[X t ] for any t £ R. 

The time average of X over the interval [0, t] is given by 

\ f* X u du. 
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Of course, for t fixed, the time average is a random variable, and is typically not equal to the 
statistical average fix- The random process X is called mean ergodic (in the m.s. sense) if 



lim - / X u du = fix m.s. 
t->oc t J 

A discrete time WSS random process X is similarly called mean ergodic (in the m.s. sense) if 

1 n 
lim — \ Xi = fix m.s. (7.30) 



hm 

i=l 



For example, by the m.s. version of the law of large numbers, if X = {X n : n £ Z) is WSS with 
Cxip) — I{n=o\ ( so that the JQ's are uncorrelated) then (7.30) is true. For another example, if 
Cx{n) = 1 for all n, it means that Xq has variance one and P{Xk = Xq} = 1 for all k (because 
equality holds in the Schwarz inequality: Cx(n) < Cx(0)). Then for all n > 1, 

1 n 

— / Xk = Xq. 

fc=i 

Since Xq has variance one, the process X is not ergodic if Cx{n) = 1 for all n. In general, whether 
X is m.s. ergodic in the m.s. sense is determined by the autocovariance function, Cx- The result 
is stated and proved next for continuous time, and the discrete-time version is true as well. 

Proposition 7.4.1 Let X be a real-valued, WSS, m.s. continuous random process. Then X is 
mean ergodic (in the m.s. sense) if and only if 

&ijf( L r) c ' (T) *- - <7 - 31) 

Sufficient conditions are 

(a) liniT-^oo Cx{t) = 0. (This condition is also necessary if liniT-^oo Cx{t) exists.) 

(b) JZ\Cx(r)\dr < +oo. 

(c) lim^oo R x {t) = 0. 

(d) r oo \R X (r)\dr<+^. 

Proof. By the definition of m.s. convergence, X is mean ergodic if and only if 

= 0. (7.32) 
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Since E 



i r* 



j f X u du = j J fJ-xdu = fix, (7.32) is the same as Var ( j f Q X u du\ — > as t — > oo. By 



the properties of m.s. integrals, 



Var ( - f X u du 



i r* 



Cov 



X~,du, - 

t ,/n t 







X v dv 



o 



1 



t /•* 



Cx(^ — v)dudv 
'o Jo 

i /"* pt—v 
J 1 



1 

<2 



Cx{j)drdv 

J— u 
t rt—r rO rt 

/ C x (r)(it»(iT+ / / C x (T)dvdT 

./(J J-tJ-T 



(7.33) 

(7.34) 
(7.35) 



t 



i-r 



Cx(r)dr 



Cx(r)ci7 



where for v fixed the variable t = u — v was introduced, and we use the fact that in both (7.34) 
and (7.35), the pair (v,t) ranges over the region pictured in Figure 7.4. This establishes the first 




Figure 7.4: Region of integration for (7.34) and (7.35). 

statement of the proposition. 

For the remainder of the proof, it is important to keep in mind that the integral in (7.33) is 
simply the average of Cx{u — v) over the square [0,t] x [0, i\. The function Cx{u — v) is equal to 
Cx(0) along the diagonal of the square, and the magnitude of the function is bounded by Cx(0) 
everywhere in the square. Thus, if Cx(u,v) is small for u — v larger than some constant, if t is 
large, the average of Cx{u — v) over the square will be small. The integral in (7.31) is equivalent 
to the integral in (7.33), and both can be viewed as a weighted average of Cx(t), with a triangular 
weighting function. 

It remains to prove the assertions regarding (a)-(d). Suppose Cx{t) — > c as r — > oo. We claim 
the left side of (7.31) is equal to c. Indeed, given e > there exists L > so that \Cx(t) — c\ < e 



7.4. ERGODICITY 



227 



whenever t > L. For < r < L we can use the Schwarz inequality to bound Cx(t), namely 
\Cx{t)\ < C x (0). Therefore for t > L, 



t-T 



Cx(j)dr — c 



< 



< 



t-T 



t-T 



(Cx(i 



c) dT 



\Cx{r)-c\dT 



t 



(C x (0) + \c\) dT + 



2e rt-T 



o 



t 



L 



t 



dT 



< 2L(Cx(0) + |c|) , 2e /"* ^J_ dr = 2L(C X (0) + |c|) 

f L J t T t 

< 2e for t large enough 



Thus the left side of (7.31) is equal to c, as claimed. Hence if lim.,-^00 Cx(t) = c, (7.31) holds if 
and only if c = 0. It remains to prove that (b), (c) and (d) each imply (7.31). 
Suppose condition (b) holds. Then 

UW) c *H - ii' |c - v(T)i<fr 



i f°° 

< - \Cx(r)\dT ^ as t -> oo 

t J — oo 



so that (7.31) holds. 

Suppose either condition (c) or condition (d) holds. By the same arguments applied to Cx for 
parts (a) and (b), it follows that 



t-T 



Rx(r)dT — > as t — > oo. 



Since the integral in (7.31) is the variance of a random variable, it is nonnegative. Also, the integral 
is a weighted average of Cx(t), and Cxij) = Rx(t) — [i x . Therefore, 



< 



t- T 



C x (r)dt 



-Vx + 



t-T 



Rx{ T )dT — > — n x as t ^ cx). 



Thus, (7.31) holds, so that X is mean ergodic in the m.s. sense. In addition, we see that conditions 
(c) and (d) also each imply that \xx = 0. I 



Example 7.4.2 Let f c be a nonzero constant, let G be a random variable such that cos(O), sin(G), 
cos(2G), and sin(2G) have mean zero, and let A be a random variable independent of G such that 
E[A 2 } < +oo. Let X = (X t : t G K) be defined by X t = Acos(2n f c t + G). Then X is WSS with 
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fix = and Rx(i~) = Cx(j) = — — c ° 2 ^ . Condition (7.31) is satisfied, so X is mean ergodic. 



Mean ergodicity can also be directly verified: 



1 



X u du 



< 



A 



cos(2n f c u + Q)du 



A(am(2irf c t + 6) - sin(6)) 



\A\ 



2irf c t 
m.s. as t — > oo. 



Example 7.4.3 (Composite binary source) A student has two biased coins, each with a zero on one 
side and a one on the other. Whenever the first coin is flipped the outcome is a one with probability 
|. Whenever the second coin is flipped the outcome is a one with probability j. Consider a random 
process (Wk :fc£Z) formed as follows. First, the student selects one of the coins, each coin being 
selected with equal probability. Then the selected coin is used to generate the W^s — the other 
coin is not used at all. 

This scenario can be modelled as in Figure 7.5, using the following random variables: 



Ui 



Vi 



s=o 
s 



3 



Wu 



Figure 7.5: A composite binary source. 



• {Uk '■ k G Z) are independent Be (|) random variables 

• (Vfc : k G Z) are independent Be (-A random variables 

• S is a Be ( 2 ) random variable 

• The above random variables are all independent 

• W k = (1 - S)U k + sv k . 

The variable S can be thought of as a switch state. Value 5 = corresponds to using the coin with 
probability of heads equal to | for each flip. 

Clearly W is stationary, and hence also WSS. Is W mean ergodic? One approach to answering 
this is the direct one. Clearly 



Hw 



E[W k 



E[W k \S = 0}P[S = 0] + E[W k | S = 1]P[S = 1] 



1 



3 1 11 

4 ' 2 + 4 ' 2 ~ 2 
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So the question is whether 

1 " 1 

- V W fc -» - m.s. 
n z — ' 2 

fc=i 

But by the strong law of large numbers 

1 n -. n 

-Y,w k = -J2((i-s)u k + sv k ) 



n ■' — ' n 

fc=i fe=i 



fe=i / \ fc=i / 



\ n 

"^- (1-5)- + S- = ---. 
v ; 4 4 4 2 

Thus, the limit is a random variable, rather than the constant |. Intuitively, the process W has 
such strong memory due to the switch mechanism that even averaging over long time intervals does 
not diminish the randomness due to the switch. 

Another way to show that W is not mean ergodic is to find the covariance function Cw and use 
the necessary and sufficient condition (7.31) for mean ergodicity. Note that for k fixed, Wi 2 = W k 
with probability one, so £?[W^] = g- If k / I, then 

E[W k W{[ = E[W k W t | S = 0}P[S = 0] + E[W k W t \ S = 1]P[S = 1] 

= E[U k U t ] l - + ElVkVi] 1 - 

= EiUkjEp^ + EiVkjEiV^ 
3\ 2 1 /1\ 2 1 5 




2 + V 1 ,/ 2 1 {> " 

Therefore, 

C w {n) 

Since linin^oo Cw{n) exists and is not zero, W is not mean ergodic. 

In many applications, we are interested in averages of functions that depend on multiple random 
variables. We discuss this topic for a discrete time stationary random process, (X n : n £ Z). Let h 
be a bounded, Borel measurable function on M fc for some k. What time average would we expect 
to be a good approximation to the statistical average E[h(X\, . . . ,X k )]7 A natural choice is 

n 2j=l h{Xj,Xj+\, ..., Xj + k-i). 
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We define a stationary random process (X n : n G Z) to be ergodic if 

1 ™ 
lim -Y j h{X j ,...,X j+k _ l ) = E[h(X 1 ,...,X k )} 

i=i 

for every k > 1 and for every bounded Borel measurable function /i on R*, where the limit is taken 
in any of the three senses a.s., p. or m.s. 3 An interpretation of the definition is that if X is ergodic 
then all of its finite dimensional distributions are determined as time averages. 
As an example, suppose 

, / n f 1 if xi > > x 2 

h ^ X2) = { else " • 

Then h(X±,X2) is one if the process (X k ) makes a "down crossing" of level between times one 
and two. If X is ergodic then with probability 1, 

lim —(number of down crossings between times 1 and n + 1) = P[X\ > > X-A. (7.36) 

n-^ oo n 

Equation (7.36) relates quantities that are quite different in nature. The left hand side of (7.36) 
is the long time-average downcrossing rate, whereas the right hand side of (7.36) involves only the 
joint statistics of two consecutive values of the process. 

Ergodicity is a strong property. Two types of ergodic random processes are the following: 

• a process X = (X k ) such that the X k s are iid. 

• a stationary Gaussian random process X such that linin^oo Rx{n) = or, Ynxin^^ Cx{n) = 0. 

7.5 Complexification, Part I 

In some application areas, primarily in connection with spectral analysis as we shall see, complex 
valued random variables naturally arise. Vectors and matrices over C are reviewed in the appendix. 
A complex random variable X = U + jV can be thought of as essentially a two dimensional random 
variable with real coordinates U and V. Similarly, a random complex n-dimensional vector X can be 
written as X = U + jV, where U and V are each n-dimensional real vectors. As far as distributions 
are concerned, a random vector in n-dimensional complex space C n is equivalent to a random vector 
with 2n real dimensions. For example, if the 2n real variables in U and V are jointly continuous, 
then X is a continuous type complex random vector and its density is given by a function fx(%) 
for x G C n . The density fx is related to the joint density of U and V by fx(u + jv) = fuv(u, v) 
for all u,v <ER n . 

As far as moments are concerned, all the second order analysis covered in the notes up to this 
point can be easily modified to hold for complex random variables, simply by inserting complex 



' The mathematics literature uses a different definition of ergodicity for stationary processes, which is equivalent. 
There are also definitions of ergodicity that do not require stationarity. 
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conjugates in appropriate places. To begin, if X and Y are complex random variables, we define 
their correlation by £[X7*] and similarly their covariance as E[(X — E[X])(Y — E\Y])*\. The 
Schwarz inequality becomes |i£[.X"Y*]| < y / i?[|X| 2 ]E , [|Y| 2 ] and its proof is essentially the same as 
for real valued random variables. The cross correlation matrix for two complex random vectors 
X and Y is given by £[JF*], and similarly the cross covariance matrix is given by Cov(X, Y) = 
E[(X - E[X])(Y - E[Y})*}. As before, Cov(X) = Cov(X, X). The various formulas for covariance 
still apply. For example, if A and C are complex matrices and b and d are complex vectors, then 
Cov(AX + b,CY + d) = ACov(X, Y)C*. Just as in the case of real valued random variables, a 
matrix K is a valid covariance matrix (in other words, there exits some random vector X such that 
K = Cov(X)) if and only if K is Hermitian symmetric and positive semidefinite. 

Complex valued random variables X and Y with finite second moments are said to be orthogonal 
if E[X7*] = 0, and with this definition the orthogonality principle holds for complex valued random 
variables. If X and Y are complex random vectors, then again _E[X|Y] is the MMSE estimator of 
X given Y, and the covariance matrix of the error vector is given by Cov(X) — Cov(i?[A A |y]). The 
MMSE estimator for X of the form AY + b (i.e. the best linear estimator of X based on Y) and 
the covariance of the corresponding error vector are given just as for vectors made of real random 
variables: 

E[X\Y] = E[X] + Cov(X, Y)Cov(Y)- 1 (Y - E[Y]) 
Cov(X - E[X\Y}) = Cov(X) - Cov(X, F)Cov(y)- 1 Cov(y, X) 

By definition, a sequence X\, X2, ... of complex valued random variables converges in the m.s. 
sense to a random variable X if -E[|X n | 2 ] < 00 for all n and if lim n ^oo £^[|X n — X\ 2 ] = 0. The 
various Cauchy criteria still hold with minor modification. A sequence with E'fjXnl 2 ] < 00 for all 
n is a Cauchy sequence in the m.s. sense if lim mi?woo i£[|X„ — X m | 2 ] = 0. As before, a sequence 
converges in the m.s. sense if and only if it is a Cauchy sequence. In addition, a sequence X\, X2, ■ ■ ■ 
of complex valued random variables with .E^X^ 2 ] < 00 for all n converges in the m.s. sense if 
and only if lim m!rwoo E[X m X*] exits and is a finite constant c. If the m.s. limit exists, then the 
limiting random variable X satisfies i?[|X| 2 ] = c. 

Let X = (Xf : t 6 T) be a complex random process. We can write Xt = Ut + jVt where U and 
V are each real valued random processes. The process X is defined to be a second order process 
if i?[|A^| 2 ] < 00 for all t. Since \Xt\ 2 = Uf + V 2 for each t, X being a second order process is 
equivalent to both U and V being second order processes. The correlation function of a second 
order complex random process X is defined by Rx(s,t) = E[X S X^]. The covariance function is 
given by Cx(s, t) = Cov(X s , X t ) where the definition of Cov for complex random variables is used. 
The definitions and results given for m.s. continuity, m.s. differentiation, and m.s. integration all 
carry over to the case of complex processes, because they are based on the use of the Cauchy criteria 
for m.s. convergence which also carries over. For example, a complex valued random process is m.s. 
continuous if and only if its correlation function Rx is continuous. Similarly the cross correlation 
function for two second order random processes X and Y is defined by Rxy( s , t) — -^P^s^i*]. Note 
that RxY(s,t) = Ryx(t,s)- 

Let X = (Xt : f 6 T) be a complex random process such that T is either the real line or 
the set of integers, and write Xt = Ut + jVt where U and V are each real valued random pro- 
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cesses. By definition, X is stationary if and only if for any t\, . . . , t n £ T, the joint distribution of 
(X tl+S , . . . ,Xt n + s ) is the same for all seT. Equivalently, X is stationary if and only if U and V 
are jointly stationary. The process X is defined to be WSS if X is a second order process such that 
-E[-Xt] does not depend on t, and Rx(s, t) is a function of s — t alone. If X is WSS we use Rx{t) to 
denote Rx(s, t), where r = s — t. A pair of complex- valued random processes X and Y are defined 
to be jointly WSS if both X and Y are WSS and if the cross correlation function Rxy(s,t) is a 
function of s — t. If X and Y are jointly WSS then Rxy{— t) — Ryx( T )- 

In summary, everything we've discussed in this section regarding complex random variables, 
vectors, and processes can be considered a simple matter of notation. One simply needs to add 
lines to indicate complex conjugates, to use \X\ 2 instead of X 2 , and to use a star "*" for Hermitian 
transpose in place of "T" for transpose. We shall begin using the notation at this point, and return 
to a discussion of the topic of complex valued random processes in a later section. In particular, 
we will examine complex normal random vectors and their densities, and we shall see that there is 
somewhat more to complexification than just notation. 

7.6 The Karhunen-Loeve expansion 

We've seen that under a change of coordinates, an n-dimensional random vector X is transformed 
into a vector Y = U*X such that the coordinates of Y are orthogonal random variables. Here 
U is the unitary matrix such that E[XX*] = UAU*. The columns of U are eigenvectors of the 
Hermitian symmetric matrix £[JJ*] and the corresponding nonnegative eigenvalues of E[XX*] 
comprise the diagonal of the diagonal matrix A. The columns of U form an orthonormal basis for 
C ra . The Karhunen-Loeve expansion gives a similar change of coordinates for a random process on 
a finite interval, using an orthonormal basis of functions instead of an othonormal basis of vectors. 
Fix a finite interval [a, b] . The L 2 norm of a real or complex valued function / on the interval 
[a, b] is defined by 

\\f\\ = j£\f(t)\ 2 dt 

We write L 2 [a, b] for the set of all functions on [a, b] which have finite 1? norm. The inner product 
of two functions / and g in L 2 [a, b] is defined by 

(/, g)= I f(t)g*(t)dt 



The functions / and g are said to be orthogonal if (/, g) = 0. Note that ||/|| = y/(f,f) and the 
Schwarz inequality holds: \(f,g}\ < ||/|| • Hflil- A finite or infinite set of functions ((f> n ) in L 2 [a,b] 
is said to be an orthonormal system if the functions in the set are mutually orthogonal and have 
norm one, or in other words, (fa, <j)j) = Iu=j\ for all i and j. 

In many applications it is useful to use representations of the form 

N 
f(t) = J2 c nM), ( 7 - 37 ) 
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for some orthonormal system (pi, ... , <p^. In such a case, we think of (ci, . . . , cat) as the coordinates 
of / relative to the orthonormal system ((f> n ), and we might write / <-> (ci, . . . , cat). For example, 
transmitted signals in many digital communication systems have this form, where the coordinate 
vector (ci, , . . . ,cjv) represents a data symbol. The geometry of the space of all functions / of 
the form (7.37) for the fixed orthonormal system <fii, . . . , <^jv is equivalent to the geometry of the 
coordinates vectors. For example, if g has a similar representation, 

N 

git) = ^2<i n (/) n (t), 

71=1 

or equivalently g «-> (di, . . . , djv), then / + g <-> (ci, . . . , cat) + (di, . . . , d^) and 

(f,g) 



[ \ Yl C ™<t>m(t) \ I J2 <K(t) \ dt 
Ja lm=l J L=l J 

N N fe 

^ ^2c m d* n / <t> m (t)<t>n(t)dt 

m=ln=l a 

AT AT 

^ ^C m d*(</> m ,0 n ) 



m=l n=l 

A 



E c ™ d ™ (7-38) 



771 = 1 



That is, the inner product of the functions, (/, g), is equal to the inner product of their coordinate 
vectors. Note that for 1 < n < N, 4> n <-> (0, . . . , 0, 1, 0, . . . , 0), such that the one is in the n 
position. If / «-> (ci, . . . , cjv), then the n th coordinate of / is the inner product of / and (f) n : 



b / N \ N 

if, 4>n)= [J2 C ^ra{t) 4>* n {t)dt = J] , 

Ja \m=l / 777=1 



■m\Ymi Yn 



Another way to derive that (/, <f> n ) = c n is to note that / «-> (ci, . . . , cat) and </> n «-> (0, . . . , 0, 1, 0, . . . , 0), 
so (/, <p n ) is the inner product of (ci, . . . , cat) and (0, . . . , 0, 1, 0, . . . , 0), or c n . Thus, the coordinate 
vector for / is given by / <-> ((/, </>i), . . . , (/, </>n))- 

The dimension of the space L 2 [a,b] is infinite, meaning that there are orthonormal systems 
(4>n '■ n > 1) with infinitely many functions. For such a system, a function / can have the 
representation 

oo 

f(t) = J2c n <l> n (t). (7.39) 

77=1 

In many instances encountered in practice, the sum (7.39) converges for each t, but in general what 
is meant is that the convergence is in the sense of the L 2 [a, b] norm: 



pb N 2 

lim / /(J)-VcA(i) d* = 0, 
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or equivalently, 



lim 



jV 



f ~^2 c n(, 



n=l 







The span of a set of functions (pi, . . . ,(pN is the set of all functions of the form a\<j)\{t) + 
• • • + ajv^jv(i)- If the functions (pi, ... , cp^ form an orthonormal system and if / € L 2 [a, b], then the 
function /" in the span of (pi, . . . , 4>N that minimizes ||/ — /"|| is given by f*(t) = X)n=i(/> (pn)<Pn(t). 
In fact, it is easy to check that / — /" is orthogonal to (p n for all n, implying that for any complex 
numbers oi, . . . ,ajv, 



N 



N 



£a n <rfHi/-/»u 2 + £i</», 



n=l 



n=l 



Thus, the closest approximation is indeed given by a n = (f,(p n )- That is, /" given by /"(£) 



Z^n=l\/' 



6 n (t) is the projection of / onto the span of the 0's. Furthermore, 

AT 



/« 



11/ 



«l|2 



Ek/^«) 



(7.40) 



n=l 



The above reasoning is analogous to that in Proposition 3.2.4. 

An orthonormal system {<p n ) is said to be an orthonormal basis for L 2 [a, b], if any / £ L 2 [a,b] 
can be represented as in (7.39). If {(pn) is an orthonormal system then for any f,g€ L 2 [a, b], (7.38) 
still holds with iV replaced by oo and is known as Parseval's relation: 



(f,9) = J2(fM{9,<f>ny 



71=1 



In particular, 



CO 



71=1 



l(/,^n)| 2 . 

A commonly used orthonormal basis is the following (with [a,b] = [0, T] for some T > 0): 
Mt) = -jm, Mt) = \/|cos(^), (P 3 (t) = 



IT 1 



|sin(^), 



hk(t) = J | cos(^), thk+i(t) = VI sin(^) 



for fc > 1. 



(7.41) 



Next, consider what happens if / is replaced by a random process X = {Xt : a < t < b). Suppose 
((pn : 1 < n < N) is an orthonormal system consisting of continuous functions, with N < oo. The 
system does not have to be a basis for L 2 [a, b], but if it is then there are infinitely many functions 
in the system. Suppose that X is m.s. continuous, or equivalently, that Rx is continuous as a 
function on [a, b] x [a,b\. In particular, Rx is bounded. Then E J \Xt\ 2 dt = J Rx(t,t)dt < 

so that f |Xt| 2 dt is finite with probability one. Suppose that X can be represented as 



oo, 



X f 



N 

E 

77=1 



C n (p n (t). 



(7.42) 
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Such a representation exists if (4> n ) is a basis for L 2 [a, b], but some random processes have the form 
(7.42) even if TV is finite or if N is infinite but the system is not a basis. The representation (7.42) 
reduces the description of the continuous-time random process to the description of the coefficients, 
(C n ). This representation of X is much easier to work with if the coordinate random variables are 
orthogonal. 

Definition 7.6.1 A Karhunen-Loeve (KL) expansion for a random process X = {Xt : a < t <b) 
is a representation of the form (7.42) with N < oo such that: 

(1) the functions {4> n ) dre orthonormal: (4> m ,<f> n ) — 2/ m=n \, an d 

(2) the coordinate random variables C n are mutually orthogonal: E[C m C^\ = 0. 



Example 7.6.2 Let Xt = A for < t < T, where A is a random variable with < E"L4 2 ] < oo. 
Then X has the form in (7.42) for [a, b) = [0, j 
trivially a KL expansion, with only one term. 



Then X has the form in (7.42) for [a,b] = [0,2*1, N = 1, d = AVT, and faty) = /{0 ^ T} . This is 



Example 7.6.3 Let Xt = Acos(2irt/T+Q) for < t < T, where A is a real- valued random variable 
with < .EL4 2 ] < oo, and is a random variable uniformly distributed on [0, 2tt] and independent 
of A. By the cosine angle addition formula, Xt = ^4cos(G) cos(27rf/T) — ^4sin(G) sin(2-7ri/T). Then 
X has the form in (7.42) for [a, b] = [0,T], N = 2, 

Cl = AV2fcos(G), C 2 = -AV2fsin(e), fr(t) = ™W T ) , H t) S ^ 2 ^ 



IT " V2T 

In particular, <p\ and <f>2 form an orthonormal system with N = 2 elements. To check whether 
this is a KL expansion, we see if E[dC^] = 0. Since £[CiC|] = -2T E[A 2 \E[cos{Q) sin(0)] = 
—TE [A 2 ]E[sm(2Q)] = 0, this is indeed a KL expansion, with two terms. 

An important property of Karhunen-Loeve (KL) expansions in practice is that they identify the 
most accurate finite dimensional approximations of a random process, as described in the following 
proposition. A random process Z = {Zt : a < t < b) is said to be N -dimensional if it has the form 
Z t = "C n =i Bnipn(t) for some ./V random variables B\, . . . , B^ and A"" functions ipi, . . . , tpN- 

Proposition 7.6.4 Suppose X has a Karhunen-Loeve (KL) expansion X t = X^nLi C n <j) n (t) (See 
Definition 7.6.1). Let X n = E[\C n \ 2 ] and suppose the terms are indexed so that Ai > A2 > • • • • For 
any finite N > 1, the N th partial sum, X^ N \t) = "C n =i C n 4> n (t), is a choice for Z that minimizes 
E[\\X — Z\\ 2 ] over all N- dimensional random processes Z. 

Proof. Suppose Z is a random linear combination of N functions, ipi, . . . , ?/>7v- Without loss of 
generality, assume that tpi, . . . , V'tv is an orthonormal system. (If not, the Gram-Schmidt procedure 
could be applied to get an orthonormal system of iV functions with the same span.) We first 
identify the optimal choice of random coefficients for the ^'s fixed, and then consider the optimal 
choice of the ^' s - F° r a given choice of ^'s and a sample path of X, the L 2 norm \\X — Z\\ 2 
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is minimized by projecting the sample path of X onto the span of the V's, which means taking 
Zt = ^2j = i(X, ifij)ipj(i). That is, the sample path of Z has the form of /" above, if / is the sample 
path of X. This determines the coefficients to be used for a given choice of ?/>'s; it remains to 
determine the ifi's. By (7.40), the (random) approximation error is 

N 

\\x-z\\ 2 = \\x\\ 2 -J2\(x,^)\ 2 

3=1 



Using the KL expansion for X yields 



E[\(X^ J )\ 2 ] = E 



^c n ((/>„, v>j 



n=l 



5> n i<0 n ,^->i 5 



n=l 



Therefore, 



E\\\X-Z\ 



E\\\X\ 



/ / A n 6 n 



n=l 



rJV 



(7.43) 



where b n = Ylj=i \(<p n , ipj)\ ■ Note that (b n ) satisfies the constraints < b n < 1, and ^^L-^ b n = N. 
The right hand side of (7.43) is minimized over (b n ) subject to these constraints by taking b n = 
I{i<n<N}- That can be achieved by taking ipj = <fij for 1 < j < N, in which case {X,ipj) = Cj, and 
Z becomes X( '. M 



Proposition 7.6.5 Suppose X = {Xt : a < t < b) is m.s. continuous and ((f> n ) is an orthonormal 
system of continuous functions. If (7.42) holds for some random variables (C n ), it is a KL expan- 
sion (i.e., the coordinate random variables are orthogonal) if and only if the (f) n 's are eigen] 'unctions 
ofR x : 



Rx4>n = A r 



(7.44) 



where for a function 4> £ L 2 [a,b], Rx4> denotes the function (Rx4>)(s) = J Rx(s,t)4>(t)dt. In case 
(7.42) is a KL expansion, the eigenvalues are given by X n = E[\C n \ 2 ]. 

Proof. Suppose (7.42) holds. Then C n = (X,(f> n ) = f* X t cj)* n {t)dt, so that 



E[C m C* n ] = E[{X,<l> m )(X,<t> n )*] 
E [ I X s <P* m (s)ds 



X t 4>* n {t)dt 



Rx(s, t)4>* m (s)4> n (t)dsdt 

(Rx4>n,<Pm) 



(7.45) 



Now, if the <f) n 's are eigenfunctions of Rx, then E[C m Cl] = (Rx<Pn, 4>m) = (K4>n, 4>m) = K(<f>n, 4>m) : 
^nl{m=n}- I n particular, E[C m C*] = if n / m, so that (7.42) is a KL expansion. Also, taking 
m = n yields .E[|C n | 2 ] = A n . 
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Conversely, suppose (7.42) is a KL expansion. Without loss of generality, suppose that the 
system (4> n ) is a basis of L 2 [a,b\. (If it weren't, it could be extended to a basis by augmenting it 
with functions from another basis and applying the Gramm-Schmidt method of orthogonalizing.) 
Then for n fixed, (Rx<f> n , 4>m) — for all m / n. By the fact (<fi n ) is a basis, the function Rx4>n has 
an expansion of the form (7.39), but all terms except possibly the n are zero. Hence, R n 4>n = A n n 
for some constant A n , so the eigenrelations (7.44) hold. Again, E'fjCnl 2 ] = A n by the computation 
above. H 



The following theorem is stated without proof. 

Theorem 7.6.6 (Mercer's theorem) // Rx is the autocorrelation function of a m.s. continuous 
random process X = (X t : a < t < b) (equivalently, if Rx is a continuous function on [a, b] x [a, b] 
that is positive semi-definite, i.e. Rx(ti,tj) is a positive semidefinite matrix for any n and any 
a < t\ < t2 < ■ ■ ■ < t n < b), then there exists an orthonormal basis for L 2 [a,b], (<f> n : n > 1), of 
continuous eigenf unctions and corresponding nonnegative eigenvalues (X n : n > 1) for Rx, and Rx 
is given by the following series expansion: 



R X ( S ,t) = J2 X "Ms)<t>n(t)- 
n=l 

The series converges uniformly in s, t, meaning that 

N 

Rx{s,t)- VU(s)C(f) 



(7.46) 



lim max 

N^oo s,t£[a,b] 



0. 



Theorem 7.6.7 ( Karhunen-Loeve expansion) // X 
random process it has a KL expansion, 



(Xt : a < t < b) is a m.s. continuous 



X t = J2<f>n(t)(X, 



n=l 



and the series converges in the m.s. sense, uniformly over t £ [a, b\. 



(R 



Proof. Use the orthonormal basis (</> n ) guaranteed by Mercer's theorem. By (7.45), E[{X, (f> m )*{X, 



X<Pn,(Pm) 



KI{n=m}- Also, 



E[X t (X,<f> n )*] = E[X t I X* s <t> n {s)ds\ 

J a 



Rx(t, S)(f) n (s)ds = X n (f>n{t). 



These facts imply that for finite N, 

N 



E 



X t -J24>n(t)(XAn) 



71=1 



N 



R X (t,t)-J2 X n\(/>n(t)\' 



(7.47) 



71 = 1 
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which, since the series on the right side of (7.47) converges uniformly in t as n — > oo, implies the 
stated convergence property for the representation of X. I 

Remarks (1) The means of the coordinates of X in a KL expansion can be expressed using the 
mean function Hx(t) = E[Xt] as follows: 



E[(X,<f> n ) 



V-x{t)4>* n {t)dt = (nxAn) 



Thus, the mean of the n coordinate of X is the n coordinate of the mean function of X. 
(2) Symbolically, mimicking matrix notation, we can write the representation (7.46) of Rx as 



Rx(s,t) = [Ms)\Ms)\ 



Ai 



A 2 






(3) If / € L 2 [a,b] and f(t) represents a voltage or current across a resistor, then the energy 
dissipated during the interval [a, b] is, up to a multiplicative constant, given by 



(Energy of /) 



\f(t)\ 2 dt=J2\(f,<t>n) 



n=l 



The mean total energy of (Xt : a < t < b) is thus given by 



\X*\ z dt 



-i: 


Rx(t, 


t)dt 




- f 

■J a 


oo 
n=l 


\Mt)\' 


dt 


oo 

= E 


An 







n=l 



(4) If {Xt : a < t < b) is a real valued mean zero Gaussian process and if the orthonormal basis 
functions are real valued, then the coordinates (X, <p n ) are uncorrelated, real valued, jointly Gaus- 
sian random variables, and therefore are independent. 



Example 7.6.8 Let W = (Wt '■ t > 0) be a Brownian motion with parameter a 2 . Let us find the 
KL expansion of W over the interval [0, T]. Substituting Rx(s, t) = a 2 (s A t) into the eigenrelation 
(7.44) yields 



t c-T 

2„J. ( n \rl n I / „2, 



a S(p n (s)ds + la t<p n {s)ds = \ n <f> n (t) 
o Jt 



(7.48) 
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Differentiating (7.48) with respect to t yields 

a 2 t<t> n (t) ~ ° 2 t<t>n(t) + f a 2 (f) n (s)ds = Xn<t>' n (t), (7.49) 

and differentiating a second time yields that the eigenfunctions satisfy the differential equation 
X(p" = —a 2 <p. Also, setting t — in (7.48) yields the boundary condition </> n (0) = 0, and setting 
t — T in (7.49) yields the boundary condition <j)' n (T) = 0. Solving yields that the eigenvalue and 
eigenfunction pairs for W are 

4a 2 T 2 , , N [2 /(2n+l)7rf\ 

An = ~, TTT^T 4>n\t) = \ — Sin — Tl > 

(2n+l) 2 7T 2 ^" W V T V 2T / 

It can be shown that these functions form an orthonormal basis for L 2 [0, T\. 



Example 7.6.9 Let X be a white noise process. Such a process is not a random process as defined 
in these notes, but can be defined as a generalized process in the same way that a delta function can 
be defined as a generalized function. Generalized random processes, just like generalized functions, 
only make sense when multiplied by a suitable function and then integrated. For example, the 
delta function 5 is defined by the requirement that for any function / that is continuous at t = 0, 



f(t)S(t)dt = /(0) 



A white noise process X is such that integrals of the form J_ f(t)X(t)dt exist for functions / with 
finite 1? norm ||/||. The integrals are random variables with finite second moments, mean zero and 



correlations given by 



E 



f(s)X s ds) U g(t)X t dt 



a 1 \ f(t)g*(t)dt 



In a formal or symbolic sense, this means that X is a WSS process with mean fix = and 
autocorrelation function Rx(s,t) = E[X S X^] given by Rxij) = a 2 5(r). 

What would the KL expansion be for a white noise process over some fixed interval [a,b]? 
The eigenrelation (7.44) becomes simply a 2 (j)(t) = \ n <f>(t) for all t in the interval. Thus, all the 
eigenvalues of a white noise process are equal to a 2 , and any function (j> with finite norm is an 
eigenfunction. Thus, if (<f> n : n > 1) is an arbitrary orthonormal basis for L 2 [a,b], then the 
coordinates of the white noise process X, formally given by X n = (X, <p n ), satisfy 

E[X n X*J = a 2 I {n=m} . (7.50) 

This offers a reasonable interpretation of white noise. It is a generalized random process such that 
its coordinates {X n : n > 1) relative to an arbitrary orthonormal basis for a finite interval have 
mean zero and satisfy (7.50). 
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7.7 Periodic WSS random processes 

Let X = (X t : t G R) be a WSS random process and let T be a positive constant. 

Proposition 7.7.1 The following three conditions are equivalent: 

(a) R X (T) = R X (0) 

(b) P[X T+r = X T } = 1 for allreR 

(c) Rx(T + r) = Rx(t) for all r G R fi.e. Rxij) is •periodic with period T.) 

Proof. Suppose (a) is true. Since Rx(0) is real valued, so is Rx(T), yielding 

E[\Xt+ t — X T \) = E[Xt+ t X?t +t — Xt+ t X* — X T X T+T + X T X*] 
= R x (0) - R X (T) - R* X (T) + R X (0) = 

Therefore, (a) implies (b). Next, suppose (b) is true and let r G R. Since two random variables 
that are equal with probability one have the same expectation, (b) implies that 

Rx(T + t) = E[X T+T X* ] = E[X T X* ] = R x {t). 

Therefore (b) imples (c). Trivially (c) implies (a), so the equivalence of (a) through (c) is proved. 



Definition 7.7.2 We call X aperiodic, WSS process of period T if X is WSS and any of the three 
equivalent properties (a), (b), or (c) of Proposition 7.8.1 hold. 

Property (b) almost implies that the sample paths of X are periodic. However, for each r it can 
be that X T / X t+ t on an event of probability zero, and since there are uncountably many real 
numbers r, the sample paths need not be periodic. However, suppose (b) is true and define a 
process Y by Y% — X,. moc [ T y (Recall that by definition, (t mod T) is equal to t + nT, where n 
is selected so that < t + nT < T.) Then Y has periodic sample paths, and Y is a version of X, 
which by definition means that P{Xt = Yt} = 1 for any t G R. Thus, the properties (a) through 
(c) are equivalent to the condition that X is WSS and there is a version of X with periodic sample 
paths of period T. 

Suppose X is a m.s. continuous, periodic, WSS random process. Due to the periodicity of X, 
it is natural to consider the restriction of X to the interval [0,T]. The Karhunen-Loeve expansion 
of X restricted to [0,T] is described next. Let <p n be the function on [0,T] defined by 

e 2irjnt/T 
4>n{t) = ■ 



T 



7.7. PERIODIC WSS RANDOM PROCESSES 241 

The functions (cf) n : n £ Z) form an orthonormal basis for L 2 [0, T]. 4 In addition, for any n fixed, 
both Rxij) and <\> n are periodic with period dividing T, so 



f Rx(s 




,t)Mt)dt 


- f 

Jo 


Rx(s - t)4> n {t)dt 






- f 

J s— 


Rx{t)4> n {s 

T 


-t)dt 






- f 

Jo 


Rx(t)Ms ~ 


-t)dt 






= 4/ T ^(t)e 2 
v 1 Jo 

= X n (p n (s). 


livjns/T 



where A n is given by 

An = / Rx{t)e- 2 ^ nt l T dt = Vf(RxAn)- (7.51) 

./o 

Therefore <p n is an eigenfunction of Rx with eigenvalue A n . The Karhunen-Loeve expansion (5.20) 
of X over the interval [0, T] can be written as 

oo 

X t = J2 X ne 2njnt/T (7.52) 

n=— oo 

where X n is defined by 



X n = -±=(X,<f> n ) = 1 / X t e- 2 ^ nt l T dt 
vT ± Jn 



Note that 

E[X m X* n ] = ^E[{XAm){X,<p n y] = j,I {m=n} 

Although the representation (7.55) has been derived only for < t < T, both sides of (7.55) are 
periodic with period T. Therefore, the representation (7.55) holds for all t. It is called the spectral 
representation of the periodic, WSS process X. 

By (7.54), the series expansion (7.39) applied to the function Rx over the interval [0, T] can be 
written as 

oo , 

Rx(t) = J2 Y e2?T]nt/T 

n=—oo 



5>xMe^', (7.53) 



Here it is more convenient to index the functions by the integers, rather than by the nonnegative integers. Sums 
of the form 5^ST=-oo snou ld be interpreted as limits of ^2„ = _ N as N — » oo. 



242 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES 

where px is the function on the real line R = (u; : — oo < a; < oo), 5 defined by 

/ \ f Ki/T to = -Sr 1 for some integer n 
**(") = ( else 

and the sum in (7.56) is only over to such that px(w) / 0. The function px is called the power 
spectral mass function of X. It is similar to a probability mass function, in that it is positive for 
at most a countable infinity of values. The value px(-f?) is equal to the power of the n term in 
the representation (7.55): 



E[\X n e^^ T f] = E[\X n f]= P x( 2 ^] 



and the total mass of px is the total power of X, i?x(0) = i?[|X t | 2 ]. 

Periodicity is a rather restrictive assumption to place on a WSS process. In the next chapter we 
shall further investigate spectral properties of WSS processes. We shall see that many WSS random 
processes have a power spectral density. A given random variable might have a pmf or a pdf, and 
it definitely has a CDF. In the same way, a given WSS process might have a power spectral mass 
function or a power spectral density function, and it definitely has a cumulative power spectral 
distribution function. The periodic WSS processes of period T are precisely those WSS processes 
that have a power spectral mass function that is concentrated on the integer multiples of -jr. 

7.8 Periodic WSS random processes 

Let X = {Xt : t G R) be a WSS random process and let T be a positive constant. 

Proposition 7.8.1 The following three conditions are equivalent: 

(a) R X (T) = R x {0) 

(b) P[X T+r = X T } = 1 for all r G R 

(c) Rx(T + r) = Rxij) for all r G R (i.e. Rxij) is periodic with period T.) 

Proof. Suppose (a) is true. Since Rx(0) is real valued, so is Rx(T), yielding 

E[\Xt+ t — X T \ ] = E[Xt+ t X t+t — Xt+ t X* — X T X^ +T + X T X*] 
= R x (0) - R X (T) - R* X (T) + R x (0) = 

Therefore, (a) implies (b). Next, suppose (b) is true and let r G R. Since two random variables 
that are equal with probability one have the same expectation, (b) implies that 

Rx(T + t) = E[X t+t X*} = E[X T X*} = R x {r). 

Therefore (b) imples (c). Trivially (c) implies (a), so the equivalence of (a) through (c) is proved. 



The Greek letter u) is used here as it is traditionally used for frequency measured in radians per second. It is 
related to the frequency / measured in cycles per second by to — 2irf. Here ui is not the same as a typical element 
of the underlying space of all outcomes, fl. The meaning of u> should be clear from the context. 
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Definition 7.8.2 We call X aperiodic, WSS process of period T if X is WSS and any of the three 
equivalent properties (a), (b), or (c) of Proposition 7.8.1 hold. 

Property (b) almost implies that the sample paths of X are periodic. However, for each r it can 
be that X T / X t+ t on an event of probability zero, and since there are uncountably many real 
numbers r, the sample paths need not be periodic. However, suppose (b) is true and define a 
process Y by Yt = X, t moc [ T y (Recall that by definition, (i mod T) is equal to t + nT, where n 
is selected so that < t + nT < T.) Then Y has periodic sample paths, and Y is a version of X, 
which by definition means that P{Xt = Yt} = 1 for any i£l. Thus, the properties (a) through 
(c) are equivalent to the condition that X is WSS and there is a version of X with periodic sample 
paths of period T. 

Suppose X is a m.s. continuous, periodic, WSS random process. Due to the periodicity of X, 
it is natural to consider the restriction of X to the interval [0,T]. The Karhunen-Loeve expansion 
of X restricted to [0,T] is described next. Let <f> n be the function on [0,T] defined by 

e 2TTJnt/T 
<t>n{t) = ■ 



T 

The functions (cf) n : n £ Z) form an orthonormal basis for L 2 [0,T]. 6 In addition, for any n fixed, 
both Rx{t) and cf> n are periodic with period dividing T, so 

T pT 

Rx(s,t)<f> n (t)dt = / R x (s - t)4> n {t)dt 
o Jo 

Rx(t)<f>n(s-t)dt 

s-T 
T 

Rx(t)4> n (s - t)dt 



T 



^= [ R x {t)e 2 ^ ns / T e- 2 ^ nt ' T dt 



where A n is given by 

A 



„ = [ R x {t)e- 2 ^ nt / T dt = Vf(RxAn). (7.54) 

Therefore <p n is an eigenfunction of Rx with eigenvalue A n . The Karhunen-Loeve expansion (5.20) 
of X over the interval [0, T] can be written as 

oo 

X t = J2 X ne 2njnt/T (7.55) 



Here it is more convenient to index the functions by the integers, rather than by the nonnegative integers. Sums 
of the form 5^ST=-oo snom d be interpreted as limits of Y2„ = _ N as N — » oo. 
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where X n is defined by 








X n — - 


±=(X,<p n ) = U T X t e-^/ T dt 
vi 1 Jo 


Note that 








E[X m X£\ 


= ±E[{X,<f> m )(X,cf )n )*} = ^/ {m= 



Although the representation (7.55) has been derived only for < t < T, both sides of (7.55) are 
periodic with period T. Therefore, the representation (7.55) holds for all t. It is called the spectral 
representation of the periodic, WSS process X. 

By (7.54), the series expansion (7.39) applied to the function Rx over the interval [0, T] can be 
written as 



-e 
T 

n=—oo 



^n r 2njnt/T 

T 



R x (t) = £ 

n=—(x 

J2px(oj)e ju,t , (7.56) 



where px is the function on the real line R = (a; : — oo < to < oo), 7 defined by 

/ \ f A n /T w = ^3? for some integer n 
PxM = { else 

and the sum in (7.56) is only over u such that px(w) / 0. The function px is called the power 
spectral mass function of X. It is similar to a probability mass function, in that it is positive for 
at most a countable infinity of values. The value px(-f?) is equal to the power of the n term in 
the representation (7.55): 

E[\X n e^/ T f] = E[\X n f]=px(^) 

and the total mass of px is the total power of X, Rx(0) = E[\X t \ 2 ]. 

Periodicity is a rather restrictive assumption to place on a WSS process. In the next chapter we 
shall further investigate spectral properties of WSS processes. We shall see that many WSS random 
processes have a power spectral density. A given random variable might have a pmf or a pdf, and 
it definitely has a CDF. In the same way, a given WSS process might have a power spectral mass 
function or a power spectral density function, and it definitely has a cumulative power spectral 
distribution function. The periodic WSS processes of period T are precisely those WSS processes 
that have a power spectral mass function that is concentrated on the integer multiples of ^. 



The Greek letter u) is used here as it is traditionally used for frequency measured in radians per second. It is 
related to the frequency / measured in cycles per second by to — 2nf. Here uj is not the same as a typical element 
of the underlying space of all outcomes, fl. The meaning of u> should be clear from the context. 
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7.9 Problems 

7.1 Calculus for a simple Gaussian random process 

Define X = (X t : t G R) by X t = A + Bt + Ct 2 , where A,B, and C are independent, N(0, 1) 

random variables, (a) Verify directly that X is m.s. differentiable. (b) Express P < L X s ds > 1 > 
in terms of Q, the standard normal complementary CDF. 

7.2 Lack of sample path continuity of a Poisson process 

Let N = (Nf : t > 0) be a Poisson process with rate A > 0. (a) Find the following two probabilities, 
explaining your reasoning: P{N is continuous over the interval [0,T] } for a fixed T > 0, and 
P{N is continuous over the interval [0, oo)}. (b) Is N sample path continuous a.s.? Is TV m.s. 
continuous? 

7.3 Properties of a binary valued process 

Let Y = (Yt : t > 0) be given by Yj = (— 1) ', where iV is a Poisson process with rate A > 0. 
(a) Is Y a Markov process? If so, find the transition probability function Pij(s, t) and the transition 
rate matrix Q. (b) Is Y mean square continuous? (c) Is Y mean square differentiable? (d) Does 
limj'^oo tj, L ytdt exist in the m.s. sense? If so, identify the limit. 

7.4 Some statements related to the basic calculus of random processes 

Classify each of the following statements as either true (meaning always holds) or false, and justify 
your answers. 

(a) Let X t = Z, where Z is a Gaussian random variable. Then X = (X t : t € R) is mean ergodic 
in the m.s. sense. 

(b) The function Rx defined by Rx(t) = < i ^ s a van d autocorrelation function. 

(c) Suppose X = (Xt : t G R) is a mean zero stationary Gaussian random process, and suppose X 
is m.s. differentiable. Then for any fixed time t, Xt and X[ are independent. 

7.5 Differentiation of the square of a Gaussian random process 

(a) Show that if random variables (A n : n > 0) are mean zero and jointly Gaussian and if 
liniyj^oo A n = A m.s., then linin^oo A\ = A 2 m.s. (Hint: If A,B,C, and D are mean zero and 
jointly Gaussian, then E[ABCD] = E[AB]E[CD] + E[AC]E[BD] + E[AD]E[BC}.) 

(b) Show that if random variables (A n , B n : n > 0) are jointly Gaussian and linin^oo A n = A m.s. 
and linijj^oo B n = B m.s. then linin^oo A n B n = AB m.s. (Hint: Use part (a) and the identity 
ab 



{a+b) 2 -a 2 -b 2 



2 ■> 

(c) Let X be a mean zero, m.s. differentiable Gaussian random process, and let Yt = X 2 for all t. 
Is Y m.s. differentiable? If so, justify your answer and express the derivative in terms of Xt and 
K 

7.6 Cross correlation between a process and its m.s. derivative 

Suppose X is a m.s. differentiable random process. Show that Rx'X = d\Rx- (It follows, in 
particular, that d\Rx exists.) 
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7.7 Fundamental theorem of calculus for m.s. calculus 

Suppose X = (Xt : t > 0) is a m.s. continuous random process. Let Y be the process denned by 
Yt = L X u du for t > 0. Show that X is the m.s. derivative of Y. (It follows, in particular, that Y 
is m.s. differentiable.) 

7.8 A windowed Poisson process 

Let N = (Nt : t > 0) be a Poisson process with rate A > 0, and let X = (X t : t > 0) be defined by 
JQ = iVt+i — iV(. Thus, X$ is the number of counts of N during the time window (t, t + 1]. 

(a) Sketch a typical sample path of N, and the corresponding sample path of X. 

(b) Find the mean function fj,x(t) and covariance function Cx(s,t) for s,t > 0. Express your 
answer in a simple form. 

(c) Is X Markov? Why or why not? 

(d) Is X mean-square continuous? Why or why not? 

(e) Determine whether j L X s ds converges in the mean square sense as t — > oo. 

7.9 An integral of white noise times an exponential 

Let X t = L Z u e~ u du, for t > 0, where Z is white Gaussian noise with autocorrelation function 
5(t)<t 2 , for some a 2 > 0. (a) Find the autocorrelation function, Rx(s,t) for s,t > 0. (b) Is X 
mean square differentiable? Justify your answer, (c) Does Xt converge in the mean square sense 
as t — > oo? Justify your answer. 

7.10 A singular integral with a Brownian motion 

Consider the integral L ^dt, where w = (wt : t > 0) is a standard Brownian motion. Since 
Var(^j-) = j diverges as t — > 0, we define the integral as lim e ^o / ^j~dt m.s. if the limit exists. 

(a) Does the limit exist? If so, what is the probability distribution of the limit? 

(b) Similarly, we define J^° ^j-dt to be Ywht^oo j 1 ^dt m.s. if the limit exists. Does the limit 
exist? If so, what is the probability distribution of the limit? 

7.11 An integrated Poisson process 

Let iV = (Nf : t > 0) denote a Poisson process with rate A > 0, and let Yt = L N s ds for s > 0. (a) 
Sketch a typical sample path of Y. (b) Compute the mean function, /xy(t), for t > 0. (c) Compute 
Var(Yi) for t > 0. (d) Determine the value of the limit, lim^oo P[Yt < i\. 

7.12 Recognizing m.s. properties 

Suppose X is a mean zero random process. For each choice of autocorrelation function shown, indi- 
cate which of the following properties X has: m.s. continuous, m.s. differentiable, m.s. integrable 
over finite length intervals, and mean ergodic in the the m.s. sense. 

(a) X is WSS with R x (t) = (1 - |r|)+, 

(b) X is WSS with R x (t) = 1 + (1 - |r|)+, 

(c) X is WSS with R x {t) = cos(207rr)exp(-10|r|), 

(d) Rx(s, t) = < , (not WSS, you don't need to check for mean ergodic property) 

I \J clot? 

(e) Rx(s,t) = Vs A t for s,t > 0. (not WSS, you don't need to check for mean ergodic property) 



7.9. PROBLEMS 247 

7.13 A random Taylor's approximation 

Suppose X is a mean zero WSS random process such that Rx is twice continuously differentiable. 
Guided by Taylor's approximation for deterministic functions, we might propose the following 
estimator of Xt given Xq and Xq. X t = Xq + tX' . 

(a) Express the covariance matrix for the vector (Xq, Xq, Xt) in terms of the function Rx and its 
derivatives. 

(b) Express the mean square error E[(Xt — Xt) 2 ] in terms of the function Rx and its derivatives. 

(c) Express the optimal linear estimator E[Xt\XQ, X' ] in terms of Xq, X' , and the function Rx and 
its derivatives. 

(d) (This part is optional - not required.) Compute and compare lim^o (mean square error) /t 4 
for the two estimators, under the assumption that Rx is four times continuously differentiable. 

7.14 A stationary Gaussian process 

Let X = (Xt : f 6 Z) be a real stationary Gaussian process with mean zero and Rx(t) = ttts- 
Answer the following unrelated questions. 

(a) Is A" a Markov process? Justify your anwer. 

(b) Find SLY3IA0] and express P{|A3 — £"[A3|Ao]| > 10} in terms of Q, the standard Gaussian 
complementary cumulative distribution function. 

(c) Find the autocorrelation function of X', the m.s. derivative of X. 

(d) Describe the joint probability density of (Xq, X' , X±) t . You need not write it down in detail. 

7.15 Integral of a Brownian bridge 

A standard Brownian bridge B can be defined by Bt = Wt — tW\ for < t < 1, where W is a 
Brownian motion with parameter a 2 = 1. A Brownian bridge is a mean zero, Gaussian random 
process which is a.s. sample path continuous, and has autocorrelation function Rs(s,t) = s(l — t) 
for < s < t < 1. 

(a) Why is the integral X = J Btdt well defined in the m.s. sense? 

(b) Describe the joint distribution of the random variables X and W\. 

7.16 Correlation ergodicity of Gaussian processes 

A WSS random process X is called correlation ergodic (in the m.s. sense) if for any constant h, 

ft 



1 /■* 

lim m.s.- / X s+h X s ds = E[X s+h X s 



Suppose A is a mean zero, real-valued Gaussian process such that Rx(t) — > as \t\ — > 00. Show 
that X is correlation ergodic. (Hints: Let Yj = Xt+hXt- Then correlation ergodicity of X is 
equivalent to mean ergodicity of Y. If A, B, C, and D are mean zero, jointly Gaussian random 
variables, then E[ABCD] = E[AB]E[CD] + E[AC]E[BD] + E[AD\E[BC\. 

7.17 A random process which changes at a random time 

Let Y = (Yt :tel) and Z = (Zt : f £ 1) be stationary Gaussian Markov processes with mean zero 
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and autocorrelation functions Ry{t) = Rz(t) = e~' T '. Let !7bea real-valued random variable and 
suppose Y, Z, and U, are mutually independent. Finally, let X = (X t : t G 1) be denned by 



X, 



Y t t<U 
Z t t>U 



(a) Sketch a typical sample path of X. 

(b) Find the first order distributions of X. 

(c) Express the mean and autocorrelation function of X in terms of the CDF, Fjj, of U. 

(d) Under what condition on F\j is X m.s. continuous? 

(e) Under what condition on Fy is X a Gaussian random process? 

7.18 Gaussian review question 

Let X = (Xt : t £ R) be a real-valued stationary Gauss-Markov process with mean zero and 
autocorrelation function Cx{t) = 9exp(— |r|). 

(a) A fourth degree polynomial of two variables is given by p(x, y) = a+bx+cy+dxy+ex 2 y+fxy 2 +... 
such that all terms have the form cx % y° with i + j < 4. Suppose X2 is to be estimated by an 
estimator of the form p(Xo,X\). Find the fourth degree polynomial p to minimize the MSE: 
E[(X 2 - p(X , Xi)) 2 } and find the resulting MMSE. (Hint: Think! Very little computation is 
needed.) 

(b) Find P[X 2 > 4|Xo = -,X\ = 3]. You can express your answer using the Gaussian Q function 
Q(c) = J c °° ~jK=e~ u ' 2 du. (Hint: Think! Very little computation is needed.) 



7.19 First order differential equation driven by Gaussian white noise 

Let X be the solution of the ordinary differential equation X' = —X + N, with initial condition xq, 
where N = (Nt : t > 0) is a real valued Gaussian white noise with Rn(t) = <j 2 5(t) for some con- 
stant a 2 > 0. Although TV is not an ordinary random process, we can interpret this as the condition 
that ./V is a Gaussian random process with mean [i^ = and correlation function Rn(t) = cj 2 5(t). 

(a) Find the mean function fJ,x(t) and covariance function Cx(s,t). 

(b) Verify that X is a Markov process by checking the necessary and sufficient condition: Cx(r, s)Cx(s, t) 
Cx(r,t)Cx(s,s) whenever r < s < t. (Note: The very definition of X also suggests that X is a 
Markov process, because if t is the "present time," the future of X depends only on Xt and the 
future of the white noise. The future of the white noise is independent of the past (X s : s < t). 
Thus, the present value Xt contains all the information from the past of X that is relevant to the 
future of X. This is the continuous-time analog of the discrete-time Kalman state equation.) 

(c) Find the limits of nx(t) and Rx{t + r, t) as t — > 00. (Because these limits exist, X is said to be 
asymptotically WSS.) 

7.20 KL expansion of a simple random process 

Let X be a WSS random process with mean zero and autocorrelation function 
R x (r) = 100(cos(107tt)) 2 = 50 + 50cos(207rr). 

(a) Is X mean square differentiable? (Justify your answer.) 

(b) Is X mean ergodic in the m.s. sense? (Justify your answer.) 
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(c) Describe a set of eigenfunctions and corresponding eigenvalues for the Karhunen-Loeve expan- 
sion of (X t :0< t< 1). 

7.21 KL expansion of a finite rank process 

Suppose Z = (Z t : < t < T) has the form Z t = Yl n =i X n £,n(t) such that the functions £i, ■ • • , £jv 
are orthonormal over the interval [0, T], and the vector X = (X\, ...,Xx) T has a correlation matrix 
K with det(K) / 0. The process Z is said to have rank N. Suppose K is not diagonal. Describe 
the Karhunen-Loeve expansion of Z. That is, describe an orthornormal basis (<f) n : n > 1), and 
eigenvalues for the K-L expansion of X, in terms of the given functions (£ n ) and correlation matrix 
K. Also, describe how the coordinates (Z,(p n ) are related to X. 

7.22 KL expansion for derivative process 

Suppose that X = (Xt : < t < 1) is a m.s. continuously differentiable random process on the 
interval [0, 1]. Differentiating the KL expansion of X yields X'(t) = ^2 n (X, <fi n ) cf>' n (t) , which looks 
similar to a KL expansion for X' , but it may be that the functions cf>' n are not orthonormal. For some 
cases it is not difficult to identify the KL expansion for X'. To explore this, let ((f) n (t)), ((X,(f> n )), 
and (A n ) denote the eigenfunctions, coordinate random variables, and eigenvalues, for the KL 
expansion of X over the interval [0, 1]. Let (ipk(t)): ((-^"'j^fc))) an d (Mfc)j denote the corresponding 
quantities for X' . For each of the following choices of ((f> n (t)), express the eigenfunctions, coordinate 
random variables, and eigenvalues, for X' in terms of those for X : 

(a) <f> n (t) = e 2 ^ nt , neZ 

(b) (f>i(t) = 1, (f> 2k (t) = \/2cos(2nkt), and 4>2k+i{t) = V2sin(2irkt) for k > 1. 

(c) (f> n (t) = v / 2sin( (2 "+ 1)7rt ), n > 0. (Hint: Sketch <p n and <f/ n for n = 1,2,3.) 

(d) <f>i(t) = ci(l + V%t) and fait) = C2(l — \/3i)- (Suppose A n = for n {1, 2}. The constants c n 
should be selected so that \\(f> n \\ — 1 for n = 1,2, but there is no need to calculate the constants for 
this problem.) 

7.23 An infinitely differentiable process 

Let X = (Xt : i G 1) be WSS with autocorrelation function Rx(t) = e~ T < 2 . (a) Show that X is 
A:-times differentiable in the m.s. sense, for all k > 1. (b) Let X^ k > denote the k th derivative process 
of X, for k > 1. Is X^ ' mean ergodic in the m.s. sense for each k? Justify your answer. 

7.24 Mean ergodicity of a periodic WSS random process 

Let A be a mean zero periodic WSS random process with period T > 0. Recall that X has a power 
spectral representation 



Xt — 2_^ x n 



e 2TTJnt/T _ 



n& 



where the coefficients X n are orthogonal random variables. The power spectral mass function of X 
is the discrete mass function px supported on frequencies of the form -|?, such that -E[|A n | 2 ] = 
Px(^p L )- Under what conditions on px is the process X mean ergodic in the m.s. sense? Justify 
your answer. 
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7.25 Application of the KL expansion to estimation 

Let X = (Xt : < T) be a random process given by X t = AB s'm^), where A and T are positive 
constants and B is a iV(0, 1) random variable. Think of X as an amplitude modulated random 
signal. 

(a) What is the expected total energy of XI 

(b) What are the mean and covariance functions of XI 

(c) Describe the Karhunen-Loeve expansion of X. (Hint: Only one eigenvalue is nonzero, call it 
Ai. What are Ai, the corresponding eigenfunction (f>i, and the first coordinate X\ = {X, <p\)l You 
don't need to explicitly identify the other eigenfunctions <f>2, <t>z, ■ ■ ■■ They can simply be taken to 
fill out an orthonormal basis.) 

(d) Let N = {Xt : < T) be a real-valued Gaussian white noise process independent of X with 
Rn(t~) = ct 2 (5(t), and let Y = X + N. Think of Y as a noisy observation of X. The same basis 
functions used for X can be used for the Karhunen-Loeve expansions of N and Y. Let N\ = (N, <p\) 
and Y\ = (Y,<j>i). Note that Y\ = X± + N\. Find -E[-B|Yi] and the resulting mean square error. 
(Remark: The other coordinates Y2, I3, . . . are independent of both X and Y\, and are thus useless 
for the purpose of estimating B. Thus, i?[5|Yi] is equal to -E[i?|y], the MMSE estimate of B given 
the entire observation process Y.) 

7.26 * An autocorrelation function or not? 

Let Rx(s,t) = cosh(a(|s — t\ — 0.5)) for —0.5 < s,t < 0.5 where a is a positive constant. Is Rx 
the autocorrelation function of a random process of the form X = (Xt : —0.5 < t < 0.5)? If not, 
explain why not. If so, give the Karhunen-Loeve expansion for X. 

7.27 * On the conditions for m.s. differentiability 

t 2 sin(l/t 2 ) t/0 



(a) Let f(t) = < . Sketch / and show that / is differentiable over all of R, and 

find the derivative function /'. Note that /' is not continuous, and f_ 1 f'(t)dt is not well defined, 
whereas this integral would equal /(l) — /(— 1) if /' were continuous. 

(b) Let Xt = Af(t), where A is a random variable with mean zero and variance one. Show that X 
is m.s. differentiable. 

(c) Find Rx- Show that d\Rx and d2d\Rx exist but are not continuous. 



Chapter 8 

Random Processes in Linear Systems 
and Spectral Analysis 



Random processes can be passed through linear systems in much the same way as deterministic 
signals can. A time-invariant linear system is described in the time domain by an impulse response 
function, and in the frequency domain by the Fourier transform of the impulse response function. 
In a sense we shall see that Fourier transforms provide a diagonalization of WSS random processes, 
just as the Karhunen-Loeve expansion allows for the diagonalization of a random process defined 
on a finite interval. While a m.s. continuous random process on a finite interval has a finite average 
energy, a WSS random process has a finite mean average energy per unit time, called the power. 

Nearly all the definitions and results of this chapter can be carried through in either discrete 
time or continuous time. The set of frequencies relevant for continuous-time random processes is all 
of R, while the set of frequencies relevant for discrete-time random processes is the interval [— n, tt]. 
For ease of notation we shall primarily concentrate on continuous-time processes and systems in 
the first two sections, and give the corresponding definition for discrete time in the third section. 

Representations of baseband random processes and narrowband random processes are discussed 
in Sections 8.4 and 8.5. Roughly speaking, baseband random processes are those which have power 
only in low frequencies. A baseband random process can be recovered from samples taken at a 
sampling frequency that is at least twice as large as the largest frequency component of the process. 
Thus, operations and statistical calculations for a continuous-time baseband process can be reduced 
to considerations for the discrete time sampled process. Roughly speaking, narrowband random 
processes are those processes which have power only in a band (i.e. interval) of frequencies. A 
narrowband random process can be represented as baseband random processes that is modulated 
by a deterministic sinusoid. Complex random processes naturally arise as baseband equivalent 
processes for real-valued narrowband random processes. A related discussion of complex random 
processes is given in the last section of the chapter. 
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8.1 Basic definitions 

The output (Yt : £ £ R) of a linear system with impulse response function h(s, t) and a random 
process input {Xt : £ £ R) is defined by 



K 



h(s,t)X t dt 



U 



See Figure 8.1. For example, the linear system could be a simple integrator from time zero, defined 

X Y 

- h 



Figure 8.1: A linear system with input X, impulse response function h, and output Y. 



by 



Y„ 



/„* X t dt s > 



in which case the impulse response function is 

h(s,t) = 



s<0, 



1 s > t > 
otherwise. 



The integral (8.1) defining the output Y will be interpreted in the m.s. sense. Thus, the integral 
defining Y s for s fixed exists if and only if the following Riemann integral exists and is finite: 



OO /"00 



h*(s, r)h(s, t)Rx(t, r)dtdr 



1.2) 



OO J — OO 



A sufficient condition for Y s to be well defined is that Rx is a bounded continuous function, and 
h(s,t) is continuous in t with J_ \h(s,t)\dt < oo. The mean function of the output is given by 



Hy(s) = E 



h(s,t)X t dt 



h(s,t)/j,x(t)dt 



(8.3) 



As illustrated in Figure 8.2, the mean function of the output is the result of passing the mean 
function of the input through the linear system. The cross correlation function between the output 



Vx 



- h 



\i Y 



Figure 8.2: A linear system with input [xx and impulse response function h. 
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and input processes is given by 

R YX (s,T) = E J h(a,t)X t dtX* 

— oo 

h(s,t)R x (t,r)dt 
and the correlation function of the output is given by 

R Y (s,u) = E 



1A) 



Y s I / h(u, r)X T dr 

1i*(u,t)Ryx(s, r)dr 

h*(u, t)Ii(s, t)Rx(t, r)dtdT 



-oo 

oo poo 



(8.5) 
(8.6) 



oo J — oo 



Recall that Y s is well defined as a m.s. integral if and only if the integral (8.2) is well defined and 
finite. Comparing with (8.6), it means that Y s is well defined if and only if the right side of (8.6) 
with u = s is well defined and gives a finite value for ^[|Y^| 2 ]. 

The linear system is time invariant if h(s, t) depends on s, t only through s — t. If the system is 
time invariant we write h(s — t) instead of h(s,t), and with this substitution the defining relation 
(8.1) becomes a convolution: Y = h * X. 

A linear system is called bounded input bounded output (bibo) stable if the output is bounded 
whenever the input is bounded. In case the system is time invariant, bibo stability is equivalent to 
the condition 

/oo 
\h(r)\dT < oo. (8.7) 

-oo 

In particular, if (8.7) holds and if an input signal x satisfies \x s \ < L for all s, then the output 
signal y = x * h satisfies 



/OO /"OO 

\h(t - s)\Lds = L I \h(r)\dT 
-oo J —oo 



for all t. If X is a WSS random process then by the Schwarz inequality, Rx is bounded by Rx(0). 
Thus, if X is WSS and m.s. continuous, and if the linear system is time-invariant and bibo stable, 
the integral in (8.2) exists and is bounded by 



Rx(0) 



OO /"OO 



\h(s - r)\\h(s - t)\dtdr = R x (0) 



oo J — oo 



\h(r)\dT < oo 



Thus, the output of a linear, time-invariant bibo stable system is well defined in the m.s. sense if 
the input is a stationary, m.s. continuous process. 

A paragraph about convolutions is in order. It is useful to be able to recognize convolution 
integrals in disguise. If / and g are functions on R, the convolution is the function f * g defined by 



/ * git) 



f(s)g(t- s)ds 
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or equivalently 

/■oo 

f*9(t)= / f(t-s)g(s)ds 



or equivalently, for any real a and b 

f*g(a + b)= I f(a + s)g(b-s)ds. 



A simple change of variable shows that the above three expressions are equivalent. However, in 
order to immediately recognize a convolution, the salient feature is that the convolution is the 
integral of the product of / and g, with the arguments of both / and g ranging over M. in such a 
way that the sum of the two arguments is held constant. The value of the constant is the value at 
which the convolution is being evaluated. Convolution is commutative: f*g = g*f and associative: 
(/ * 9) * k = f * (g * k) for three functions /, g, k. We simply write f * g * k for (/ * g) * k. The 
convolution / * g * k is equal to a double integral of the product of f,g, and k, with the arguments 
of the three functions ranging over all triples in IR 3 with a constant sum. The value of the constant 
is the value at which the convolution is being evaluated. For example, 



OO fOO 



f * g* k(a + b+ c) =/ / f(a + s + t)g(b - s)k(c - t)dsdt. 
J — 00 J — 00 

Suppose that X is WSS and that the linear system is time invariant. Then (8.3) becomes 

/OO pOO 

h(s — t)nxdt = \ix I h(t)dt 
-00 J — 00 

Observe that /J>y(s) does not depend on s. Equation (8.4) becomes 

/OO 
h(s-t)R x (t-r)dt 
-00 

= h*R x (s-T), (8.8) 

which in particular means that Ryx(s,t) is a function of s — r alone. Equation (8.5) becomes 

/■oo 

R Y (s,u)= h*(u-T)R Y x(s-T)dT. (8.9) 



The right side of (8.9) looks nearly like a convolution, but as r varies the sum of the two arguments 
is u — t + s — t, which is not constant as r varies. To arrive at a true convolution, define the new 
function h by h(v) = h*(—v). Using the definition of h and (8.8) in (8.9) yields 



Ry(s,u) = / h(r — u)(h* Rx){s — r)dr 
J —00 

= h* (h * Rx)(s — u) = h* h* Rx(s — u) 
which in particular means that Ry(s, u) is a function of s — u alone. 
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To summarize, if X is WSS and if the linear system is time invariant, then X and Y are jointly 
WSS with 

/oo _ 

/i(t)dt R Y x = h*R x Ry = h*h*R X - (8.10) 

-oo 

The convolution h* h, equal to h * h, can also be written as 

h*h(t) = I h(s)h(t - s)ds 

h(s)h*(s-t)ds (8.11) 



-00 

oo 



The expression shows that h * h(t) is the correlation between h and /i* translated by t from the 
origin. 

The equations derived in this section for the correlation functions Rx , Ryx and Ry also hold for 
the covariance functions Cx , Cyx , and Cy . The derivations are the same except that covariances 
rather than correlations are computed. In particular, if X is WSS and the system is linear and 
time invariant, then Cyx = h* Cx and Cy = h * h * Cx- 

8.2 Fourier transforms, transfer functions and power spectral den- 
sities 

Fourier transforms convert convolutions into products, so this is a good point to begin using Fourier 
transforms. The Fourier transform of a function g mapping R to the complex numbers C is formally 
defined by 

g(u) = / e-^g(t)dt (8.12) 



Some important properties of Fourier transforms are stated next. 
Linearity: ag + bh = ag + bh 
Inversion: g(t) = f^ e jwt g(u)^ 

Convolution to multiplication: g * h = gh and g * h = 2irgh 
Parseval's identity: j^ g(t)h*(t)dt = j^ g{uo)h* {uo)^ 

Transform of time reversal: h = h* , where h(t) = h*(—t) 
Differentiation to multiplication by ju>: ~^{oj) = (jui)g(uj) 
Pure sinusoid to delta function: For uj fixed: e^ ot {uj) = 2tt5(lo — lo 
Delta function to pure sinusoid: For t fixed: 5(t — t )(uj) = e~ 3ujt ° 
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The inversion formula above shows that a function g can be represented as an integral (basically 
a limiting form of linear combination) of sinusoidal functions of time e 3U)t , and g{oj) is the coefficient 
in the representation for each u. Paresval's identity applied with g = h yields that the total 
energy of g (the square of the I? norm) can be computed in either the time or frequency domain: 
INI 2 — f^oo \g{t)\ 2 dt = f^° \g(uj)\ 2 i^. The factor 2ir in the formulas can be attributed to the use 
of frequency u in radians. If ui = 2irf, then / is the frequency in Hertz (Hz) and ^ is simply df. 

The Fourier transform can be defined for a very large class of functions, including generalized 
functions such as delta functions. In these notes we won't attempt a systematic treatment, but will 
use Fourier transforms with impunity. In applications, one is often forced to determine in what 
senses the transform is well defined on a case- by-case basis. Two sufficient conditions for the Fourier 
transform of g to be well defined are mentioned in the remainder of this paragraph. The relation 
(8.12) defining a Fourier transform of g is well defined if, for example, g is a continuous function 
which is integrable: j_ \g(t)\dt < oo, and in this case the dominated convergence theorem implies 
that g is a continuous function. The Fourier transform can also be naturally defined whenever g 
has a finite I? norm, through the use of Parseval's identity. The idea is that if g has finite 1? norm, 
then it is the limit in the L 2 norm of a sequence of functions g n which are integrable. Owing to 
Parseval's identity, the Fourier transforms g n form a Cauchy sequence in the I? norm, and hence 
have a limit, which is defined to be g. 

Return now to consideration of a linear time- invariant system with an impulse response function 
h = {h{r) : r £ R). The Fourier transform of h is used so often that a special name and notation 
is used: it is called the transfer function and is denoted by H(uj). 

The output signal y = (yt : t £ R) for an input signal x = (xt '■ t £ R) is given in the time 
domain by the convolution y = x * h. In the frequency domain this becomes y(u>) = H(u)x(u). 
For example, given a < b let Hr ab i(u) be the ideal bandpass transfer function for frequency band 
[a, b], defined by 

ff [a,6]M = { I otwfse. (8 ' 13) 

If x is the input and y is the output of a linear system with transfer function H[ a ,b\: then the 
relation y{uj) = Ht a u(uj)x(uj) shows that the frequency components of x in the frequency band 
[a, b] pass through the filter unchanged, and the frequency components of x outside of the band are 
completely nulled. The total energy of the output function y can therefore be interpreted as the 
energy of x in the frequency band [a, b]. Therefore, 

/oo j rb j 

\H\ a b](uj)\ \x(u)\ — = / |x(w)| — . 
-oo 2?r j„ 2tt 

Consequently, it is appropriate to call |x(u;)| 2 the energy spectral density of the deterministic signal 
x. 

Given a WSS random process X = (Xt : t £ R), the Fourier transform of its correlation function 
Rx is denoted by Sx- For reasons that we will soon see, the function Sx is called the power spectral 
density of X. Similarly, if Y and X are jointly WSS, then the Fourier transform of Ryx is denoted 
by Syx, called the cross power spectral density function of Y and X. The Fourier transform of 
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the time reverse complex conjugate function h is equal to H* , so \H(uj)\ 2 is the Fourier transform 
of h * h. With the above notation, the second moment relationships in (8.10) become: 

S YX {oj) = H(u)S x (u) S Y (u) = \H(u)\ 2 S x (u) 

Let us examine some of the properties of the power spectral density, Sx- If /_ \Rx(~t)\dt < oo 
then Sx is well defined and is a continuous function. Because Ryx — Rxy, it follows that 
Syx = S*xv I n particular, taking Y = X yields Rx = Rx and Sx = S x , meaning that Sx is 
real- valued. 

The Fourier inversion formula applied to Sx yields that Rx{t) = f^° e^ T Sx{u) y~- In partic- 
ular, 

/OO J 

S x ^) — • (8.14) 

-oo 27r 

The expectation ^[jX^ 2 ] is called the power (or total power) of X, because if X t is a voltage or 
current accross a resistor, \Xt\ 2 is the instantaneous rate of dissipation of heat energy. Therefore, 
(8.14) means that the total power of X is the integral of Sx over R. This is the first hint that the 
name power spectral density for Sx is justified. 

Let a < b and let Y denote the output when the WSS process X is passed through the linear 
time-invariant system with transfer function H\ a M defined by (8.13). The process Y represents the 
part of X in the frequency band [a, b]. By the relation Sy = \H[a,b]\ 2 Sx and the power relationship 
(8.14) applied to Y, we have 

/oo j rb j 

Sy{y>) — = / Sxiy) — (8.15) 
-oo 27T Ja 2?T 

Two observations can be made concerning (8.15). First, the integral of Sx over any interval [a, b] 
is nonnegative. If Sx is continuous, this implies that Sx is nonnegative. Even if Sx is not con- 
tinuous, we can conclude that Sx is nonnegative except possibly on a set of zero measure. The 
second observation is that (8.15) fully justifies the name "power spectral density of X" given to Sx- 



Example 8.2.1 Suppose X is a WSS process and that Y is a moving average of X with averaging 
window duration T for some T > 0: 

1 /•* 



Y t = - / X s ds 

1 Jt-T 

Equivalently, Y is the output of the linear time-invariant system with input X and impulse response 
function h given by 

h{T) = { else 

The output correlation function is given by Ry — h* h* Rx- Using (8.11) and referring to Figure 
8.3 we find that h * h is a triangular shaped waveform: 
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Similarly, Cy = h* h * Cx- Let's find in particular an expression for the variance of Yf in terms 

" h(s) h(s-t) , f h*h 




t T 



Figure 8.3: Convolution of two rectangle functions. 



of the function Cx ■ 



Var(y t ) = Cy(0) 



(h*h)(0-T)C X (T)dT 



T 



tJ^-t^wt 



(8.16) 



The expression in (8.16) arose earlier in these notes, in the section on mean ergodicity. 

Let's see the effect of the linear system on the power spectral density of the input. Observe 
that 



H(u) 



-jut 



h(t)dt 



1 
T 



-JuT _ 1 



-JU 



2e -juT/2 



Tu 



= e -JLoT/2 



e JojT/2 _ e -jujT/2 



2j 



sin(^) 



2 



Equivalently, using the substitution u = 2irf, 

H(2irf) = e- J7rfT smc(fT) 
where in these notes the sine function is defined by 

sin(7r«) 



sinc(u) 



ti^O 



7TU 

1 n = 0. 



J.17) 



(Some authors use somewhat different definitions for the sine function.) Therefore \H(27rf)\ 2 = 
|sinc(/T)| 2 , so that the output power spectral density is given by 5y(27r/) = S'x(27r/)|sinc(/T)| 2 . 
See Figure 8.4. 



Example 8.2.2 Consider two linear time-invariant systems in parallel as shown in Figure 8.5. The 
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| sinc(«) 




Figure 8.4: The sine function and \H(2irf)\ 2 = |sinc(/T)p 



u 



X 


h 






Y 


k 



V 



Figure 8.5: Parallel linear systems. 

first has input X, impulse response function h, and output U. The second has input Y, impulse 
response function k, and output V. Suppose that X and Y are jointly WSS. We can find Ruv as 
follows. The main trick is notational: to use enough different variables of integration so that none 
are used twice. 



Ruv(t,r) 



E 



OO POO 



-co J — OO 

CO 



/OO / /'OO 

h(t - s)X s ds I / fc(r - v)Y v dv 
-co \J — oo 

h{t — s)Rxy(s — v)k*{r — v)dsdv 

{h * Rxy(t — v)} k*(r — v)dv 
= h * k * Rxy(t — t). 

Note that Ruy{t, r) is a function of t — r alone. Together with the fact that U and V are individually 
WSS, this implies that U and V are jointly WSS, and Ruv = h * k * Rxy- The relationship is 
expressed in the frequency domain as Suv = HK*Sxy, where K is the Fourier transform of k. 
Special cases of this example include the case that X = Y or h = k. 



Example 8.2.3 Consider the circuit with a resistor and a capacitor shown in Figure 8.6. Take as 
the input signal the voltage difference on the left side, and as the output signal the voltage across 
the capacitor. Also, let qt denote the charge on the upper side of the capacitor. Let us first identify 
the impulse response function by assuming a deterministic input x and a corresponding output y. 
The elementary equations for resistors and capacitors yield 



dq 1 . , qt 

dt = R {Xt - yt) and Vt= C 
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R 



x(t) 



C 



W. 



+ Jy(t) 



Figure 8.6: An RC circuit modeled as a linear system. 



Therefore 



dt RC 



(xt - yt) 



which in the frequency domain is 



3uy{u) = -^jj{x(u) -y{u)) 
so that y = Hx for the system transfer function H given by 

H(u) ' 



1 + RCjuj 

Suppose, for example, that the input X is a real-valued, stationary Gaussian Markov process, so 
that its autocorrelation function has the form Rx{t) = A 2 e~ a ' T ' for some constants A 2 and a > 0. 

Then 

2A 2 a 



and 



Sx(u) = 
S Y (u) = S x (u>)\H(u)\ 2 



to 2 + a 2 



2A 2 



a 



(uj 2 + a 2 )(l + (RCu;) 2 ) 



Example 8.2.4 A random signal, modeled by the input random process X, is passed into a linear 
time-invariant system with feedback and with noise modeled by the random process N, as shown 
in Figure 8.7. The output is denoted by Y. Assume that X and iV are jointly WSS and that the 
random variables comprising X are orthogonal to the random variables comprising TV: Rxn — 0. 
Assume also, for the sake of system stability, that the magnitude of the gain around the loop 
satisfies \H?,(uj)Hi(uj)h T 2{y))\ < 1 for all u> such that Sx(w) > or SV(a;) > 0. We shall express 
the output power spectral density Sy in terms the power spectral densities of X and N, and the 
three transfer functions H\, H2, and #3. An expression for the signal-to- noise power ratio at the 
output will also be computed. 

Under the assumed stability condition, the linear system can be written in the equivalent form 
shown in Figure 8.8. The process X is the output due to the input signal X, and iV is the output 
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X 



N 



^(^ Hj(a>) — 0^ Hfo) 



- Y 



H 3 (u>) 



Figure 8.7: A feedback system. 
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Hfa)Hj(m) 
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T 








Y t =X t+ N t 




H ? ((a) 






l-H^mjHjpi^m) 


N t 





Figure 8.8: An equivalent representation. 



due to the input noise N. The structure in Figure 8.8 is the same as considered in Example 8.2.2. 
Since Rxn = it follows that R x fj = 0, so that Sy = S x + Sfj. Consequently, 



5y(w) = S x (u) + Sjf(u) 






The output signal-to-noise ratio is the ratio of the power of the signal at the output to the power 
of the noise at the output. For this example it is given by 



E[\X t \ 2 } _ I- 
E[\Nt\ 2 } f- 



oo \Hi(uj)H 1 (uj)ISxM duj 
oo \l-H 3 (u>)H 1 (ui)H 2 (ui)\ 2 2tt 



o \H 2 (u)\ 2 S N (u) dw 

oo \l-H 3 (u!)H 1 (ui)H 2 (u>)\ 2 2n 



Example 8.2.5 Consider the linear time-invariant system defined as follows. For input signal x 
the output signal y is defined by y"' + y' + y = x + x'. We seek to find the power spectral density of 
the output process if the input is a white noise process X with Rxij) = a 2 5(r) and S x {y>) = a 2 
for all to. To begin, we identify the transfer function of the system. In the frequency domain, the 
system is described by ((jw) 3 + ju> + l)y(u) = (1 + ju)x(u>), so that 



H( U ) 



l+ju> 



1+jiO 



1+JL0 + (JLU) 3 1+J(U-LJ 3 ) 
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Hence, 



a , x o / met/ m2 ct 2 (1+o; 2 ) <t 2 (1+w 2 ) 



1 + (cj-cu 3 ) 2 l+cu 2 -2cu 4 + cu 6 ' 
Observe that 

output power = / oy(u;J — < oo. 

./-OO 27T 



8.3 Discrete-time processes in linear systems 

The basic definitions and use of Fourier transforms described above carry over naturally to discrete 
time. In particular, if the random process X = {X^ : k £ Z) is the input of a linear, discrete-time 
system with impule response function h, then the output Y is the random process given by 



Y k = J2 Kk,n)X n . 



The equations in Section 8.1 can be modified to hold for discrete time simply by replacing integration 
over R by summation over Z. In particular, if X is WSS and if the linear system is time-invariant 
then (8.10) becomes 



oc 



A*y = Mx 2_] h(n) Ryx = h*Rx RY = h*h*Rx, (8.18) 

ra= — oo 

where the convolution in (8.18) is defined for functions g and h on Z by 

■DC' 

g *h(n)= ^2 g{n-k)h(k) 

k=—oo 

Again, Fourier transforms can be used to convert convolution to multiplication. The Fourier trans- 
form of a function g = {g{n) : n £ Z) is the function 5 on [— n, n] defined by 

00 
?M=X>- iwn 0(n)- 

—00 
Some of the most basic properties are: 
Linearity: ag + bh = ag + bh 
Inversion: g(n) = f* e^ n g(uj) |^ 

Convolution to multiplication: g * h = gh and g * h = ~^gh 
Parseval's identity: J2n=-oo 9{n)h*{n) = JZ n g(u)h*(u)%% 



8.3. DISCRETE-TIME PROCESSES IN LINEAR SYSTEMS 263 

Transform of time reversal: h = h* , where h(t) = h{—t)* 

Pure sinusoid to delta function: For uj £ [— 7T, n] fixed: e^° n {uS) = 2ttS(uj — uo ) 

Delta function to pure sinusoid: For n fixed: Is n=n \ (u>) = e~3^ n ° 

The inversion formula above shows that a function jonZ can be represented as an integral 
(basically a limiting form of linear combination) of sinusoidal functions of time e Jujn , and g(oo) is 
the coefficient in the representation for each u. Paresval's identity applied with g = h yields that 
the total energy of g (the square of the L 2 norm) can be computed in either the time or frequency 

domain: || 5 || 2 = E^-oo l*(")l a = 5% l£MI 2 &- 

The Fourier transform and its inversion formula for discrete-time functions are equivalent to 
the Fourier series representation of functions in L 2 [— tt,tt] using the complete orthogonal basis 
(e J ' am : n £ Z) for L 2 [—tt,tt], as discussed in connection with the Karhunen-Loeve expansion. The 
functions in this basis all have norm 2-k. Recall that when we considered the Karhunen-Loeve 
expansion for a periodic WSS random process of period T, functions on a time interval were 
important and the power was distributed on the integers Z scaled by y. In this section, Z is 
considered to be the time domain and the power is distributed over an interval. That is, the role 
of Z and a finite interval are interchanged. The transforms used are essentially the same, but with 
j replaced by —j. 

Given a linear time-invariant system in discrete time with an impulse response function h = 
(h(r) : t £ Z), the Fourier transform of h is denoted by H{to). The defining relation for the 
system in the time domain, y — h * x, becomes y(u) = H{lo)x{uj) in the frequency domain. For 

-7T < a < b < 7T, 

l2 dcu 
2~7T' 



Energy of x in frequency interval [a, b] = / |a;(ti;)|' 

J a 



so it is appropriate to call |x(u;)| 2 the energy spectral density of the deterministic, discrete-time 
signal x. 

Given a WSS random process X = {X n : n £ Z), the Fourier transform of its correlation 
function Rx is denoted by Sx, and is called the power spectral density of X. Similarly, if Y and 
X are jointly WSS, then the Fourier transform of Ryx is denoted by Syx, called the cross power 
spectral density function of Y and X. With the above notation, the second moment relationships 
in (8.18) become: 

S YX (co) = H(u)S x (u) S Y (oo) = \H(u J )\ 2 Sx(co) 

The Fourier inversion formula applied to Sx yields that Rx(n) = f^ n e : ' ujn Sx(^ 1 )^- In partic- 
ular, 

E[\X n \ 2 } = R X (0) = f S x (oo)^. 

The expectation £"[|X n | 2 ] is called the power (or total power) of X, and for — n < a < b < i\ we 
have 



f b 
Power of X in frequency interval [a, b] = / Sx( 

J a 



duo 
[col — 
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8.4 Baseband random processes 

Deterministic baseband signals are considered first. Let x be a continuous-time signal (i.e. a 
function on R) such that its energy, J^° \x(t)\ 2 dt, is finite. By the Fourier inversion formula, the 
signal x is an integral, which is essentially a sum, of sinusoidal functions of time, e° ui '. The weights 
are given by the Fourier transform x(w). Let f o >0 and let uj = 2nf . The signal x is called a 
baseband signal, with one-sided band limit f Hz, or equivalently uj radians/second, if x{uj) = 
for \u\ > to . For such a signal, the Fourier inversion formula becomes 

x(t) = I"" e jwt x{u)— (8.19) 

Equation (8.19) displays the baseband signal x as a linear combination of the functions e 3u)t indexed 
by lo G [—u ,u ]- 

A celebrated theorem of Nyquist states that the baseband signal x is completely determined by 
its samples taken at sampling frequency 2/„. Specifically, define T by ^ = 2/ . Then 

x(t) = Yl x ( nT ) sinc ( T ) ■ ( 8 - 20 ) 

n=— oo ^ ' 

where the sinc function is defined by (8.17). Nyquist's equation (8.20) is indeed elegant. It obviously 
holds by inspection if t = mT for some integer m, because for t = mT the only nonzero term in 
the sum is the one indexed by n = m. The equation shows that the sinc function gives the correct 
interpolation of the narrowband signal x for times in between the integer multiples of T. We shall 
give a proof of (8.20) for deterministic signals, before considering its extension to random processes. 
A proof of (8.20) goes as follows. Henceforth we will use lo more often than f , so it is worth 
remembering that uj T = n. Taking t = nT in (8.19) yields 

x(nT) = f U ° e^ nT x{io) — 

•/-Wo 2lT 

' x{uj){e-^ nT )*— (8.21) 

2n 

Equation (8.21) shows that x{nT) is given by an inner product of x and e~^ nT . The functions 
e -jumT^ considered on the interval — lu < uj < lo and indexed by n G Z, form a complete orthogonal 
basis for L 2 [—u ,u> ], and f^° T\e~^ nT \ 2 i^ = 1. Therefore, x over the interval [— lo ,u}^\ has the 
following Fourier series representation: 

oo 

x(oj) = T ^ e- jujnT x(nT) uje [-u ,u ] (8.22) 

n=— oo 

Plugging (8.22) into (8.19) yields 

x(t) = V x(nT)T / ejw* e -jwnraw_ ^^ 
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The integral in (8.23) can be simplified using 



T 



, 27T 



sine 



©■ 



1.24:) 



with t = t — nT to yield (8.20) as desired. 

The sampling theorem extends naturally to WSS random processes. A WSS random process 
X with spectral density Sx is said to be a baseband random process with one-sided band limit o; 
if Sxiyi) = for | uj |> io> . 

Proposition 8.4.1 Suppose X is a WSS baseband random process with one-sided band limit uj 
and let T = tt/uj . Then for each t G K. 



v- v (t-nT 

Xt = y X n Tsmc \ — — — j m.s 



T 



1.25) 



If B is the process of samples defined by B n = X n T, then the power spectral densities of B and X 
are related by 



Sb(u) = f S x(f) f° r \^\<TT 



(8.26) 



Proof. Fix i£l. It must be shown that tx defined by the following expectation converges to 
zero as TV — > oo: 



£N 



E 



N 



x t - J2 x < 



jsmc 



n=-N 



t-nT 



When the square is expanded, terms of the form E[X a X^\ arise, where a and b take on the values 
t or nT for some n. But 



E[X a X* b ] = Rx(a-b) 



OO J 

juia/ juib\* r 



e^ a (e^ )*Sx(oj) 



2tt 



Therefore, cn can be expressed as an integration over u> rather than as an expectation: 



£jv 



Jut 



N 

E 

n=-N 



e^ nT sinc 



t-nT 
T 



S x (co 



du 
' 2tt" 



5.27) 



For t fixed, the function (e JW ' : — ui < uj < lo ) has a Fourier series representation (use (8.24)) 



Jut 



T J2e j " nT 



e Jut e -jujnT 



dui 

2^ 



E 



e^ nT sinc 



t-nT 
T 
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so that the quantity inside the absolute value signs in (8.27) is the approximation error for the 
N th partial Fourier series sum for e^ 1 . Since e^ 1 is continuous inw, a basic result in the theory 
of Fourier series yields that the Fourier approximation error is bounded by a single constant for 
all N and to, and as N — > oo the Fourier approximation error converges to uniformly on sets of 
the form | to |< uo — e. Thus ex — > as N — > oo by the dominated convergence theorem. The 
representation (8.25) is proved. 

Clearly B is a WSS discrete time random process with fj,s = Hx and 

00 



/OO J . 

e> nT »S x (u;) — 
-oo 2tt 



oo 



e^S x (co)^, 
Zir 

so, using a change of variable v = Too and the fact T = — yields 

R ^) = f/ m Y Sx i!f) d £- (8 - 28) 

But Sb(w) is the unique function on [— w, n] such that 

R B (n) = [* e j ™S B (u)^- 

J-n ^ 

so (8.26) holds. The proof of Proposition 8.4.1 is complete. I 

As a check on (8.26), we note that -B(O) = X(0), so the processes have the same total power. 
Thus, it must be that 

Sb(^ = r°S x (u,)p, (8.29) 

2tt J_ oq Ztt 

which is indeed consistent with (8.26). 



Example 8.4.2 If [ix = and the spectral density Sx of X is constant over the interval [— lo , lo ], 
then hb = and Sb(w) is constant over the interval [— 7T,tt]. Therefore i?^( n ) — CB( n ) = for 
n^O, and the samples {B{n)) are mean zero, uncorrelated random variables. 



Theoretical Exercise What does (8.26) become if X is WSS and has a power spectral density, 
but X is not a baseband signal? 
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8.5 Narrowband random processes 

As noted in the previous section, a signal - modeled as either a deterministic finite energy signal 
or a WSS random process - can be reconstructed from samples taken at a sampling rate twice the 
highest frequency of the signal. For example, a typical voice signal may have highest frequency 5 
KHz. If such a signal is multiplied by a signal with frequency 10 9 Hz, the highest frequency of the 
resulting product is about 200,000 times larger than that of the original signal. Naive application 
of the sampling theorem would mean that the sampling rate would have to increase by the same 
factor. Fortunately, because the energy or power of such a modulated signal is concentrated in a 
narrow band, the signal is nearly as simple as the original baseband signal. The motivation of this 
section is to see how signals and random processes with narrow spectral ranges can be analyzed in 
terms of equivalent baseband signals. For example, the effects of filtering can be analyzed using 
baseband equivalent filters. As an application, an example at the end of the section is given which 
describes how a narrowband random process (to be defined) can be simulated using a sampling rate 
equal to twice the one-sided width of a frequency band of a signal, rather than twice the highest 
frequency of the signal. 

Deterministic narrowband signals are considered first, and the development for random pro- 
cesses follows a similar approach. Let lu c > lu > 0. A narrowband signal (relative to uj and 
lj c ) is a signal x such that x(lu) = unless u> is in the union of two intervals: the upper band, 
(tu c — uj ,lo c + uj ), and the lower band, (— uj c — lj ,—uj c + uj ). More compactly, x{to) = if 

|| ^ I — &c\ > <^o- 

A narrowband signal arises when a sinusoidal signal is modulated by a narrowband signal, as 
shown next. Let u and v be real- valued baseband signals, each with one-sided bandwidth less than 
uj , as defined at the beginning of the previous section. Define a signal x by 

xit) = u(t)cos(uj c t) — v(t)s'm(u> c t). (8.30) 

Since cos(uj c t) = (e julct + e~ jU}ct )/2 and -sin(w c i) = (je> Uet - je~ julct )/2, (8.30) becomes 

x{u) = - {u(lu - uj c ) + u{u + u c ) + Jv(uj - uj c ) - jv(lo + uj c )} (8.31) 

Graphically, x is obtained by sliding ^u to the right by u> c , ^u to the left by u> c , ^v to the right by 
u c , and ^v to the left by u c , and then adding. Of course x is real-valued by its definition. The 
reader is encouraged to verify from (8.31) that x{uj) = x*{—uo). Equation (8.31) shows that indeed 
x is a narrowband signal. 

A convenient alternative expression for x is obtained by defining a complex valued baseband 
signal z by z(t) = u(t) + jv(t). Then x(t) = Re(z(t)e :,u ' ct ). It is a good idea to keep in mind the 
case that co c is much larger than co (written lu c ^> u> ). Then z varies slowly compared to the 
complex sinusoid e?^ ct . In a small neighborhood of a fixed time t, x is approximately a sinusoid 
with frequency lj c , peak amplitude |z(i)|, and phase given by the argument of z(t). The signal z is 
called the complex envelope of x and \z(t)\ is called the real envelope of x. 

So far we have shown that a real-valued narrowband signal x results from modulating sinusoidal 
functions by a pair of real- valued baseband signals, or equivalently, modulating a complex sinusoidal 
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function by a complex-valued baseband signal. Does every real-valued narrowband signal have such 
a representation? The answer is yes, as we now show. Let a: be a real-valued narrowband signal 
with finite energy. One attempt to obtain a baseband signal from x is to consider e~ 3u)ct x{€). This 
has Fourier transform x(ui + u c ), and the graph of this transform is obtained by sliding the graph 
of x(uj) to the left by u> c . As desired, that shifts the portion of x in the upper band to the baseband 
interval (—lo ,uj ). However, the portion of x in the lower band gets shifted to an interval centered 
about — 2iv c , so that e~ 3LOct x(t) is not a baseband signal. 

An elegant solution to this problem is to use the Hilbert transform of x, denoted by x. By 
definition, x(u) is the signal with Fourier transform — jsgn(ui)x(u) , where 



sgn(w) 



Therefore x can be viewed as the result of passing x through a linear, time-invariant system with 
transfer function — jsgn(u) as pictured in Figure 8.9. Since this transfer function satisfies H*{lo) = 
H(—lu), the output signal x is again real-valued. In addition, |iJ(w)| = 1 for all u, except uj = 0, so 




X 



--jsgn(w) 



x 



Figure 8.9: The Hilbert transform as a linear, time-invariant system. 

that the Fourier transforms of x and x have the same magnitude for all nonzero uj. In particular, 
x and x have equal energies. 

Consider the Fourier transform of x + jx. It is equal to 2x(u) in the upper band and it is zero 
elsewhere. Thus, z defined by z{t) = (x(t) +jx{t))e~ :,u)ct is a baseband complex valued signal. Note 
that x(t) = Re(x(t)) = Re(x(t) + jx(t)), or equivalently 

x(t) = Re(z{t)e^ ct ) (8.32) 

If we let u{t) = Re{z{t)) and v{t) = Im{z{t)), then u and v are real-valued baseband signals such 
that z(t) = u(t)+jv(t), and (8.32) becomes (8.30). 

In summary, any finite energy real-valued narrowband signal x can be represented as (8.30) or 
(8.32), where z{t) = u{t) + jv{t). The Fourier transform z can be expressed in terms of x by 

Z ^ = { else, (8 ' 33) 

and u is the Hermetian symmetric part of z and v is —j times the Hermetian antisymmetric part 
of z: 

u{uj) = - (z{u) + z*(-u)) v(u) = — {z{uj) - z*(-u>)) 
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In the other direction, x can be expressed in terms of u and v by (8.31). 

If x\ and X2 are each narrowband signals with corresponding complex envelope processes z\ and 
Z2, then the convolution x = x\ * X2 is again a narrowband signal, and the corresponding complex 
envelope is ^z\ * z-i- To see this, note that the Fourier transform, z, of the complex envelope z for 
x is given by (8.33). Similar equations hold for z} in terms of Xi for i — 1, 2. Using these equations 
and the fact x{ui) = Xi(co)x~2(u), it is readily seen that z(u>) = ^z\{u)z2(oj) for all u>, establishing 
the claim. Thus, the analysis of linear, time invariant filtering of narrowband signals can be carried 
out in the baseband equivalent setting. 

A similar development is considered next for WSS random processes. Let U and V be jointly 
WSS real-valued baseband random processes, and let X be defined by 

X t = U t cos(u c t)-Vtsm(u c t) (8.34) 

or equivalently, defining Z t by Z t = Ut + jVt, 

X t = Re(Z t e ju>ct ) (8.35) 

In some sort of generalized sense, we expect that X is a narrowband process. However, such 
an X need not even be WSS. Let us find the conditions on U and V that make X WSS. First, in 
order that Hx{t) not depend on t, it must be that \x\j = \xy = 0. 

Using the notation c% = cos(tv c t), s t = sm(uj c t), and r = a — b, 

Rx(a, b) = Ru(T)c a c h - R uv (T)c a s b - R vu (T)s a c b + R v (T)s a s b . 

Using the trigonometric identities such as c a c b = (c a _t + c a+b )/2, this can be rewritten as 

( Ru(t) + R v (t) \ ( R uv {t) - R vu {t) \ 
R x {a,b) = I lc a _ 6 +l I s a _ b 

( R v {t) - R v {r)\ (R uv (t) + R vu {t)\ 
+ { ^ ) c ^-{ ~ 2 ) s ^- 

Therefore, in order that Rx{a, b) is a function of a — 6, it must be that Rjj = Ry and Rjjv — —Rvu- 
Since in general Ruvij) — Rvu(—t~), the condition R\jv — —Rvu means that Rjjv is an odd 
function: Rw{t) = —Ruv(—t). 

We summarize the results as a proposition. 

Proposition 8.5.1 Suppose X is given by (8.34) or (8.35), where U and V are jointly WSS. Then 
X is WSS if and only if U and V are mean zero with Ru = Ry and Ruv — —Rvu- Equivalently, 
X is WSS if and only if Z = U + jV is mean zero and E[Z a Z b ] = for all a, b. If X is WSS then 

Rx{r) = Ru(t)cos(uj c t) + R uv (T)s'm(uJ c T) 

Sx(u) = ~[Su(u -w c ) + Su(uj + uj c ) - jS uv {uj -u c ) + jSuv(v + u c )] 
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and, with Rzij) defined by Rz(a — b) = E[Z a Z£], 

Rx(t) = l -Re(R z ( T )e^ T ) 

The functions Sx, Sjj , and Sy are nonnegative, even functions, and Sjjv is a purely imaginary 
odd function (i.e. Suv(^) — Im(Suv(<*>)) = —Suv{—w).) 

Let X by any WSS real- valued random process with a spectral density Sx, and continue to 
let lu c > uj > 0. Then X is defined to be a narrowband random process if Sx(w) = whenever 
M - w c |> lu . Equivalently, X is a narrowband random process if Rx(t) is a narrowband 
function. We've seen how such a process can be obtained by modulating a pair of jointly WSS 
baseband random processes U and V. We show next that all narrowband random processes have 
such a representation. 

To proceed as in the case of deterministic signals, we first wish to define the Hilbert transform 
of X, denoted by X. A slight concern about defining X is that the function — j'sgn(oj) does not 
have finite energy. However, we can replace this function by the function given by 

#M = -isgn(cj)/| w |< Wo+Wc , 

which has finite energy and it has a real- valued inverse transform h. Define X as the output when 
X is passed through the linear system with impulse response h. Since X and h are real valued, the 
random process X is also real valued. As in the deterministic case, define random processes Z, U, 
and V by Z t = (X t + jX t )e-^\ U t = Re(Z t ), and V t = Im(Z t ). 

Proposition 8.5.2 Let X be a narrowband WSS random process, with spectral density Sx satis- 
fying Sx(w) — unless ui c — lu < \u>\ < uj c + uo , where co < uo c . Then nx — and the following 
representations hold 

X t = Re(Z t e jWct ) = U t cos(co c t) - V t sm(u c t) 

where Z t = Ut + jVt, and U and V are jointly WSS real-valued random processes with mean zero 
and 

Su(uj) = S v (uj) = [S x (u> - w c ) + S x (w + u> c )} I\u,\<u, (8.36) 



and 



Equivalently, 



and 



S uv (u>) = j [Sx(u> + u c ) - S x {uj- u c )] I\u>\<u>„ (8.37) 

Ru(t) = R v (t) = R x (r)cos(u c T)+R x (T)sm(uj c T) (8.38) 

Ruv(r) = Rx(T)sm(u c T)-R x (T)co S (io c T) (8.39) 
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Proof To show that fix = 0, consider passing X through a linear, time-invariant system with 
transfer function K{uo) = 1 if a; is in either the upper band or lower band, and K(u) = otherwise. 
Then \xy = fix J_ h{r)dT = fixK(0) = 0. Since K(u) = 1 for all uo such that Sxifjj) > 0, it 
follows that Rx = Ry — Rxy = Ryx- Therefore i£[|.X£ — Y t \ 2 ] = so that Xt has the same mean 
as Yt, namely zero, as claimed. 

By the definitions of the processes Z, U, and V, using the notation q = cos(o; c t) and St = 
sin(o; c t), we have 

U t = X t c t + X t s t V t = -X t s t + X t c t 

The remainder of the proof consists of computing Rjj, Ry, and Ruv as functions of two variables, 
because it is not yet clear that U and V are jointly WSS. 

By the fact X is WSS and the definition of X, the processes X and X are jointly WSS, and 
the various spectral densities are given by 

\H\ Sx = Sx 



Rx 



&xx — HSx S xx - 


- H*Sx = —HSx 


Sx 


Therefore, 






R XX = R * 


R xx = ~Rx 


R x 


Thus, for real numbers a and b, 






Ru(a,b) = E[(X 


'(a)c a + X(a)s a ) {Jl 


"-{b)c h 



= Rx(a- b)(c a c b + s a s b ) + R x (a- b)(s a c b - c a s b ) 
= Rx(a-b)c a _ b + Rx(a-b)s a - b 

Thus, Ru(a, b) is a function of a — b, and Rjj{t) is given by the right side of (8.38). The proof that 
Ry also satisfies (8.38), and the proof of (8.39) are similar. Finally, it is a simple matter to derive 
(8.36) and (8.37) from (8.38) and (8.39), respectively. □ 

Equations (8.36) and (8.37) have simple graphical interpretations, as illustrated in Figure 8.10. 
Equation (8.36) means that Sjj and Sy are each equal to the sum of the upper lobe of Sx shifted 
to the left by u> c and the lower lobe of Sx shifted to the right by u> c . Similarly, equation (8.36) 
means that Sjjy is equal to the sum of j times the upper lobe of Sx shifted to the left by oj c and 
—j times the lower lobe of Sx shifted to the right by lu c . Equivalently, Sjj and Sy are each twice 
the symmetric part of the upper lobe of Sx, and Sjjy is j times the antisymmetric part of the 
upper lobe of Sx- Since Ruv is an odd function of r, if follows that Rjjv(0) = 0. Thus, for any 
fixed time t, Ut and Vt are uncorrelated. That does not imply that U s and Vt are uncorrelated for 
all s and t, for the cross correlation function Rxy is identically zero if and only if the upper lobe 
of Sx is symmetric about lo c . 

Example 8.5.3 ( Baseband equivalent filtering of a random process) As noted above, filtering of 
narrowband deterministic signals can be described using equivalent baseband signals, namely the 
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Figure 8.10: A narrowband power spectral density and associated baseband spectral densities. 



complex envelopes. The same is true for filtering of narrowband random processes. Suppose X is 
a narrowband WSS random process, suppose g is a finite energy narrowband signal, and suppose 
Y is the output process when X is filtered using impulse response function g. Then Y is also a 
WSS narrowband random process. Let Z denote the complex envelope of X, given in Proposition 
8.5.2, and let z g denote the complex envelope signal of g, meaning that z g is the complex baseband 
signal such that g(t) = Re{z g {t)e :iu)ct ). It can be shown that the complex envelope process of Y is 



-^Zg * Z } Thus, the filtering of X by g is equivalent to the filtering of Z by ^ 



2 z g- 



Example 8.5.4 (Simulation of a narrowband random process) Let lo and uj c be postive numbers 
with < oj < u c . Suppose Sx is a nonnegative function which is even (i.e. Sx(w) = Sx(—^>) for 
all to) with Sx(w) = if \\uj\ — co c \ > lo . We discuss briefly the problem of writing a computer 
simulation to generate a real- valued WSS random process X with power spectral density Sx- 

By Proposition 8.5.1, it suffices to simulate baseband random processes U and V with the 
power spectral densities specified by (8.36) and cross power spectral density specified by (8.37). 
For increased tractability, we impose an additional assumption on Sx, namely that the upper lobe 
of Sx is symmetric about u c . This assumption is equivalent to the assumption that Suv vanishes, 
and therefore that the processes U and V are uncorrelated with each other. Thus, the processes U 
and V can be generated independently. 

In turn, the processes U and V can be simulated by first generating sequences of random 
variables U n T and V n x for sampling frequency ^ = 2/ = — . A discrete time random process 
with power spectral density Sjj can be generated by passing a discrete-time white noise sequence 



1 An elegant proof of this fact is based on spectral representation theory for WSS random processes, covered for 
example in Doob, Stochastic Processes, Wiley, 1953. The basic idea is to define the Fourier transform of a WSS 
random process, which, like white noise, is a generalized random process. Then essentially the same method we 
described for filtering of deterministic narrowband signals works. 
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with unit variance through a discrete-time linear time-invariant system with real-valued impulse 
response function such that the transfer function H satisfies Sy = \H\ 2 . For example, taking 
H(u) = y/Su(uj) works, though it might not be the most well behaved linear system. (The 
problem of finding a transfer function H with additional properties such that Sjj = \H\ 2 is called 
the problem of spectral factorization, which we shall return to in the next chapter.) The samples 
VkT can be generated similarly. 

For a specific example, suppose that (using kHz for kilohertz, or thousands of Hertz) 

S x (2nf) = { I e 9 ; s ° e ° kHz < I'l < 9 '° 2 ° kHz . (8.40) 

Notice that the parameters to and u c are not uniquely determined by Sx- They must simply be 
positive numbers with uj < tu c such that 

(9, 000 kHz, 9, 020 kHz) C (/ c - f , f c + f ) 

However, only the choice f c = 9, 010 kHz makes the upper lobe of Sx symmetric around f c . 
Therefore we take f c = 9, 010 kHz. We take the minmum allowable value for f , namely f = 
10 kHz. For this choice, (8.36) yields 

S u( 2nf) _ 5,(2./) _ { I "\< 10 >'"' (8 . 41 ) 

and (8.37) yields 5[/y(27r/) = for all /. The processes U and V are continuous-time baseband 
random processes with one-sided bandwidth limit 10 kHz. To simulate these processes it is therefore 
enough to generate samples of them with sampling period T = 0.5 x 10 -4 , and then use the Nyquist 
sampling representation described in Section 8.4. The processes of samples will, according to (8.26), 
have power spectral density equal to 4 x 10 4 over the interval [— 7r,7r]. Consequently, the samples 
can be taken to be uncorrelated with i^fj^lfcl 2 ] = E'fji^fcl 2 ] = 4 x 10 4 . For example, these variables 
can be taken to be independent real Gaussian random variables. Putting the steps together, we 
find the following representation for X: 

, J ^ < ft~nT\\ , J^ „ ft-nT\\ 

X t = cos(u c t) 2_^ AjSinc I — - — I I - sin(u; c t) I 2^ B n smc I — - — I I 

\n=—oo ^ ' / \n=— oo ^ ' / 



8.6 Complexiflcation, Part II 

A complex random variable Z is said to be circularly symmetric if Z has the same distribution 
as e J Z for every real value of 9. If Z has a pdf fz, circular symmetry of Z means that fz(z) 
is invariant under rotations about zero, or, equivalently, fz(z) depends on z only through \z\. A 
collection of random variables {Z^ : i £ I) is said to be jointly circularly symmetric if for every real 
value of 0, the collection (Zj : i €. I) has the same finite dimensional distributions as the collection 
(Ziei : i <E I). Note that if (Zi : i € /) is jointly circularly symmetric, and if (Yj : j € J) is another 
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collection of random variables such that each Yj is a linear combination of Z^s (with no constants 
added in) then the collection (Yj : j € J) is also jointly circularly symmetric. 

Recall that a complex random vector Z, expressed in terms of real random vectors U and V as 
Z = U + jV, has mean EZ = EU + jEV and covariance matrix Cov(Z) = E[(Z - EZ)(Z - EZ)*). 
The pseudo-covariance matrix of Z is defined by Cov p (Z) = E[(Z — EZ)(Z — EZ) T ], and it differs 
from the covariance of Z in that a transpose, rather than a Hermitian transpose, is involved. Note 
that Cov(Z) and Cov p (Z) are readily expressed in terms of Cov(U), Cov(V), and Cov(U, V) as: 

Cov(Z) = Cov(U) + Cov(V)+j(Cov(V,U)-Cov(U,V)) 
Cov p (Z) = Cov{U)-Cov(V)+j(Cov(V,U) + Cov{U,V)) 

where Cov(V, U) = Cov(U, V) T . Conversely, 

Cov(C7) = Re (Cov(Z) + Cov p (Z)) /2, Cov(V) = Re (Cov(Z) - Cov p (Z)) /2, 

and 

Cav(l7, V) = Im (-Cov(Z) + Cov p (Z)) /2. 

The vector Z is defined to be Gaussian if the random vectors U and V are jointly Gaussian. 

Suppose that Z is a complex Gaussian random vector. Then its distribution is fully determined 
by its mean and the matrices Cov(U), Cov(V), and Cov(U, V), or equivalently by its mean and 
the matrices Cov(Z) and Cov p (Z). Therefore, for a real value of 9, Z and e J Z have the same 
distribution if and only if they have the same mean, covariance matrix, and pseudo-covariance 
matrix. Since E[e? e Z\ = e^EZ, Cov(e^Z) = Cov(Z), and Cov p (e^ e Z) = e^ e Cov p (Z), Z and e? 9 Z 
have the same distribution if and only if (e? — 1)EZ = and [e? — l)Cov p (Z) = 0. Hence, if is 
not a multiple of 7T, Z and e? Z have the same distribution if and only if EZ = and Cov p (Z) = 0. 
Consequently, a Gaussian random vector Z is circularly symmetric if and only if its mean vector 
and pseudo-covariance matrix are zero. 

The joint density function of a circularly symmetric complex random vector Z with n complex 
dimensions and covariance matrix K, with det K ^ 0, has the particularly elegant form: 

Equation (8.42) can be derived in the same way the density for Gaussian vectors with real com- 
ponents is derived. Namely, (8.42) is easy to verify if K is diagonal. If K is not diagonal, the 
Hermetian symmetric positive definite matrix K can be expressed as K = UAU*, where U is a 
unitary matrix and A is a diagonal matrix with strictly positive diagonal entries. The random 
vector Y defined by Y = U*Z is Gaussian and circularly symmetric with covariance matrix A, and 
since det(A) = det(K), it has pdf f Y (y) = 'Z^IISk)^ ■ Since |det(t/)| = 1, f z (z) = fy(U*x), 
which yields (8.42). 

Let us switch now to random processes. Let Z be a complex-valued random process and let U 
and V be the real- valued random processes such that Z t = XJ% + jVt. Recall that Z is Gaussian if U 
and V are jointly Gaussian, and the covariance function of Z is defined by Cz(s, t) = Cov(Z s , Zt). 
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The pseudo-covariance function of Z is defined by Cg(s,t) = Cov p {Z s , Zt). As for covariance 
matrices of vectors, both Cz and C p z are needed to determine Cjj, Cy, and Cjjv- 

Following the vast majority of the literature, we define Z to be wide sense stationary (WSS) 
if nz{t) is constant and if Cz(s,t) (or Rz(s,t)) is a function of s — t alone. Some authors use 
a stronger definition of WSS, by defining Z to be WSS if either of the following two equivalent 
conditions is satisfied: 

• Hz{~t) is constant, and both Cz{s,t) and C^{s,t) are functions of s — t 

• U and V are jointly WSS 

If Z is Gaussian then it is stationary if and only if it satisfies the stronger definition of WSS. 

A complex random process Z = {Zt : t € T) is called circularly symmetric if the random 
variables of the process, {Zt : t € T), are jointly circularly symmetric. If Z is a complex Gaussian 
random process, it is circularly symmetric if and only if it has mean zero and Cov^(s,i) = for 
all s,t. Proposition 8.5.2 shows that the baseband equivalent process Z for a Gaussian real-valued 
narrowband WSS random process X is circularly symmetric. Nearly all complex valued random 
processes in applications arise in this fashion. For circularly symmetric complex random processes, 
the definition of WSS we adopted, and the stronger definition mentioned in the previous paragraph, 
are equivalent. A circularly symmetric complex Gaussian random process is stationary if and only 
if it is WSS. 

The interested reader can find more related to the material in this section in Neeser and Massey, 
"Proper Complex Random Processes with Applications to Information Theory," IEEE Transactions 
on Information Theory, vol. 39, no. 4, July 1993. 

8.7 Problems 

8.1 On filtering a WSS random process 

Suppose Y is the output of a linear time-invariant system with WSS input X, impulse response 
function h, and transfer function H. Indicate whether the following statements are true or false. 
Justify your answers, (a) If |i7(cj)| < 1 for all uj then the power of Y is less than or equal to the 
power of X. (b) If X is periodic (in addition to being WSS) then Y is WSS and periodic, (c) If X 
has mean zero and strictly positive total power, and if \\h\\ 2 > 0, then the output power is strictly 
positive. 

8.2 On the cross spectral density 

Suppose X and Y are jointly WSS such that the power spectral densities Sx, Sy, and Sxy are 
continuous. Show that for each u>, \Sxy{w)\ 2 < Sx{u)Sy{u). Hint: Fix u , let e > 0, and let J e 
denote the interval of length e centered at u . Consider passing both X and Y through a linear 
time-invariant system with transfer function H e {to) = Ijc(u). Apply the Schwarz inequality to the 
output processes sampled at a fixed time, and let e — > 0. 



8.3 Modulating and filtering a stationary process 



Let X = {Xt : t £ Z) be a discrete-time mean-zero stationary random process with power E[X^ 
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1. Let Y be the stationary discrete time random process obtained from X by modulation as follows: 

Y t = X t cos(80irt + G), 

where G is independent of X and is uniformly distributed over [0, 2w]. Let Z be the stationary 
discrete time random process obtained from Y by the linear equations: 

Z t+1 = (1 - a)Z t + aY t+ i 

for all t, where a is a constant with < a < 1. (a) Why is the random process Y stationary? 
(b) Express the autocorrelation function of Y, Ry(r) = E\Y t Yq\, in terms of the autocorrelation 
function of X. Similarly, express the power spectral density of Y, Sy(u>), in terms of the power 
spectral density of X, S'x(w). (c) Find and sketch the transfer function H{oj) for the linear system 
describing the mapping from Y to Z. (d) Can the power of Z be arbitrarily large (depending on 
a)? Explain your answer, (e) Describe an input X satisfying the assumptions above so that the 
power of Z is at least 0.5, for any value of a with < a < 1. 

8.4 Filtering a Gauss Markov process 

Let X = {Xt : — oo < t < +oo) be a stationary Gauss Markov process with mean zero and 
autocorrelation function Rx{t) = exp( — |r|). Define a random process Y = (Yt : i G 1) by the 
differential equation Yt = Xt — Yt. 

(a) Find the cross correlation function Rxy- Are X and Y jointly stationary? 

(b) Find ^[y^lXs = 3]. What is the approximate numerical value? 

(c) Is Y a Gaussian random process? Justify your answer. 

(d) Is Y a Markov process? Justify your answer. 

8.5 Slight smoothing 

Suppose Y is the output of the linear time-invariant system with input X and impulse response 
function h, such that X is WSS with Rx{t) = exp(— |r|), and h(r) = ^{|t|<*} f° r o, > 0. If a 
is small, then h approximates the delta function 6(t), and consequently Yt ~ Xt- This problem 
explores the accuracy of the approximation. 

(a) Find Ryx(0), and use the power series expansion of e u to show that Ryx{fy — 1 — f + °( a ) as 
a — > 0. Here, o(a) denotes any term such that o(a)/a — > as a — > 0. 

(b) Find Ry(0), and use the power series expansion of e u to show that Ry(0) = 1 — | + o(a) as 
a-> 0. 

(c) Show that E[\X t - Y t \ 2 } = § + o(a) as a -> 0. 

8.6 A stationary two-state Markov process 

Let X = (Xfc : k £ Z) be a stationary Markov process with state space S = {1,-1} and one-step 
transition probability matrix 

1 — p p 
p 1 — p 
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where < p < 1. Find the mean, correlation function and power spectral density function of X. 
Hint: For nonnegative integers k: 

P k =(\ \) + (l-2p) k ( \ -* 

V 2 2 / V 2 2 

8.7 A stationary two-state Markov process in continuous time 

Let X = {Xt : t G M) be a stationary Markov process with state space 5 = {1, — 1} and Q matrix 

— a a 
a —a 

where a > 0. Find the mean, correlation function and power spectral density function of X. (Hint: 
Recall from the example in the chapter on Markov processes that for s < t, the matrix of transition 
probabilities Pij{s,t) is given by H(t), where r = t — s and 



l-e" 



H(l 



1-e" 



8.8 A linear estimation problem 

Suppose X and Y are possibly complex valued jointly WSS processes with known autocorrelation 
functions, cross-correlation function, and associated spectral densities. Suppose Y is passed through 
a linear time-invariant system with impulse response function h and transfer function H, and let 
Z be the output. The mean square error of estimating Xt by Zf is i£[|-Xf — Zt\ 2 ]. 

(a) Express the mean square error in terms of Rx, Ry, Rxy and h. 

(b) Express the mean square error in terms of Sx, Sy, Sxy and H. 

(c) Using your answer to part (b), find the choice of H that minimizes the mean square error. (Hint: 
Try working out the problem first assuming the processes are real valued. For the complex case, 
note that for a 2 > and complex numbers z and z , o~ 2 \z\ 2 — 2Re(z*z ) is equal to \az — —\ 2 — 2-, 
which is minimized with respect to z by z = %.) 

8.9 Linear time invariant, uncorrelated scattering channel 

A signal transmitted through a scattering environment can propagate over many different paths 
on its way to a receiver. The channel gains along distinct paths are often modeled as uncorrelated. 
The paths may differ in length, causing a delay spread. Let h = (h u '■ u G Z) consist of uncorrelated, 
possibly complex valued random variables with mean zero and i?[|/i M | 2 ] = g u . Assume that G = 
^2 u g u < 00. The variable h u is the random complex gain for delay u, and g = {g u : u G Z) is 
the energy gain delay mass function with total gain G. Given a deterministic signal x, the channel 
output is the random signal Y defined by Y{ = ^ / c ^ = _ 00 h u Xi- u . 

(a) Determine the mean and autocorrelation function for Y in terms of x and g. 

(b) Express the average total energy of Y: E[^ ii Y 2 ], in terms of x and g. 

(c) Suppose instead that the input is a WSS random process X with autocorrelation function Rx- 
The input X is assumed to be independent of the channel h. Express the mean and autocorrelation 
function of the output Y in terms of Rx and g. Is Y WSS? 
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(d) Since the impulse response function h is random, so is its Fourier transform, H = {H{uj) : — n < 
(j < 7r). Express the autocorrelation function of the random process H in terms of g. 

8.10 The accuracy of approximate differentiation 

Let X be a WSS baseband random process with power spectral density Sx, and let uj be the 
one-sided band limit of X. The process X is m.s. differentiable and X' can be viewed as the 
output of a time- invariant linear system with transfer function H{uj) — ju. 

(a) What is the power spectral density of X'l 

(b) Let Yt = t+ °2a '"" ' f° r some a > 0. We can also view Y = (Yt : f e K) as the output of 
a time-invariant linear system, with input X. Find the impulse response function k and transfer 
function K of the linear system. Show that K{uj) — > juj as a — > 0. 

(c) Let Dt = X' t — Yt- Find the power spectral density of D. 

(d) Find a value of a, depending only on u> , so that -Efl-Df | 2 ] < (0.01)i?[|A^|] 2 . In other words, for 
such a, the m.s. error of approximating X[ by Yj is less than one percent of -EflA^j 2 ]. You can use 
the fact that < 1 - ^^M < t£ for all real n> ( ffint . Find a so that Sd ^ < (o.Ol)Sx'(w) for 

M < w .) 

8.11 Some linear transformations of some random processes 

Let U = (U n : ji £ Z) be a random process such that the variables U n are independent, identically 
distributed, with E[U n ] = \i and Var(C/ n ) = a 2 , where \x ^ and a 2 > 0. Please keep in mind that 
fi / 0. Let X = (X n : n £ Z) be defined by 1„ = J2h=o U n -ka k , for a constant a with < a < 1. 

(a) Is X stationary? Find the mean function fix and autocovariance function Cx for X. 

(b) Is X a Markov process ? (Hint: X is not necessarily Gaussian. Does X have a state represen- 
tation driven by Ul) 

(c) Is X mean ergodic in the m.s. sense? 

Let U be as before, and let Y = (Y n : n £ Z) be defined by Y n = ^fcLo U n -kA k , where A is a 
random variable distributed on the interval (0,0.5) (the exact distribution is not specified), and A 
is independent of the random process U. 

(d) Is Y stationary? Find the mean function \xy and autocovariance function Cy for Y. (Your 
answer may include expectations involving A.) 

(e) Is Y a Markov process? (Give a brief explanation.) 

(f) Is Y mean ergodic in the m.s. sense? 

8.12 Filtering Poisson white noise 

A Poisson random process JV = (Nt : t > 0) has indpendent increments. The derivative of N, 
written JV', does not exist as an ordinary random process, but it does exist as a generalized random 
process. Graphically, picture N' as a superposition of delta functions, one at each arrival time of the 
Poisson process. As a generalized random process, N' is stationary with mean and autocovariance 
functions given by E[N{] = A, and C^>(s,t) = XS(s — t), repectively, because, when integrated, 
these functions give the correct values for the mean and covariance of iV: E[Nt] = L Xds and 
Cjv(s, t) = L L \5(u — v)dvdu. The random process TV' can be extended to be defined for negative 
times by augmenting the original random process N by another rate A Poisson process for negative 
times. Then N' can be viewed as a stationary random process, and its integral over intervals gives 
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rise to a process N(a, b] as described in Problem 4.17. (The process N' — A is a white noise process, 
in that it is a generalized random process which is stationary, mean zero, and has autocorrelation 
function X5(t). Both N' and N' — A are called Poisson shot noise processes. One application for 
such processes is modeling noise in small electronic devices, in which effects of single electrons can 
be registered. For the remainder of this problem, N' is used instead of the mean zero version.) Let 
X be the output when N' is passed through a linear time-invariant filter with an impulse response 
function h, such that f_ \h(t)\dt is finite. (Remark: In the special case that h{t) = IiQ<t<i\t X is 
the M/D/oo process of Problem 4.17.) 

(a) Find the mean function and covariance functions of X. 

(b) Consider the special case that h(t) = e _ 'Zr t>0 \. Explain why X is a Markov process in this 
case. (Hint: What is the behavior of X between the arrival times of the Poisson process? What 
does X do at the arrival times?) 

8.13 A linear system with a feedback loop 

The system with input X and output Y involves feedback with the loop transfer function shown. 



a 
1+jco 

(a) Find the transfer function K of the system describing the mapping from X to Y. 

(b) Find the corresponding impulse response function. 

(c) The power of Y divided by the power of X, depends on the power spectral density, Sx- Find 
the supremum of this ratio, over all choices of Sx, and describe what choice of Sx achieves this 
supremum. 

8.14 Linear and nonlinear reconstruction from samples 

Suppose Xf = ^2'^' = _ 00 g(t — n — U)B n , where the B n 's are independent with mean zero and variance 
o~ 2 > 0, g is a function with finite energy J \g(t)\ 2 dt and Fourier transform G(u), U is a random 
variable which is independent of B and uniformly distributed on the interval [0, 1]. The process X 
is a typical model for a digital baseband signal, where the B n 's are random data symbols. 

(a) Show that X is WSS, with mean zero and Rx(t) = o~ 2 g * g(t)- 

(b) Under what conditions on G and T can the sampling theorem be used to recover X from its 
samples of the form {X{nT) : n £ Z)? 

(c) Consider the particular case g{t) = (1 — |t|)+ and T = 0.5. Although this falls outside the 
conditions found in part (b), show that by using nonlinear operations, the process X can be 
recovered from its samples of the form {X{nT) : n € Z). (Hint: Consider a sample path of X) 

8.15 Sampling a cubed Gaussian process 

Let X = (X t : t £ R) be a baseband mean zero stationary real Gaussian random process with 
one-sided band limit f Hz. Thus, X t = Yl^=-oo ^nTsfnc ( *~y ) where ^ = 2/ . Let Y t = Xf for 
each t. 
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(a) Is Y stationary? Express Ry in terms of Rx, and Sy in terms of Sx and/or Rx- (Hint: If 
A, B are jointly Gaussian and mean zero, Cov(A 3 , B 3 ) = 6Cov(^, B) 3 + 9E[A 2 ]E[B 2 ]Cov(A, B).) 

(b) At what rate ^ should Y be sampled in order that Y t = J2^=-oc, Y n T' sine f f ~^, T J ? 

(c) Can Y be recovered with fewer samples than in part (b)? Explain. 

8.16 An approximation of white noise 

White noise in continuous time can be approximated by a piecewise constant process as follows. 
Let T be a small positive constant, let At be a positive scaling constant depending on T, and let 
(i?fc : k G Z) be a discrete-time white noise process with Re(k) = <T 2 /r fe=0 i . Define (At : £ G R) by 
A t = ,4 T 5 fe for £ G [fcT, (jfe + 1)T). 

(a) Sketch a typical sample path of A and express E[\ J N s ds\ 2 ] in terms of At, T and a 2 . For 
simplicity assume that T = -^ for some large integer K. 

(b) What choice of At makes the expectation found in part (a) equal to a 2 ? This choice makes 
N a good approximation to a continuous-time white noise process with autocorrelation function 
a 2 6(r). 

(c) What happens to the expectation found in part (a) as T — > if At = 1 for all T? 

8.17 Simulating a baseband random process 

Suppose a real-valued Gaussian baseband process X = (Xt : t G R) with mean zero and power 
spectral density 

1 if |/| < 0.5 



•^(-M)-, else 

is to be simulated over the time interval [—500, 500] through use of the sampling theorem with 
sampling time T = 1. (a) What is the joint distribution of the samples, X n : n G Z? (b) Of course 
a computer cannot generate infinitely many random variables in a finite amount of time. Therefore, 
consider approximating X by X' ' defined by 

N 
X { t N) = ^ X n smc(t - n) 
n=-N 

Find a condition on N to guarantee that E[(X t - x[ N) ) 2 ] < 0.01 for t G [-500,500]. (Hint: Use 
|sinc(r)| < ^r and bound the series by an integral. Your choice of N should not depend on t 
because the same A should work for all t in the interval [—500, 500] ). 

8.18 Filtering to maximize signal to noise ratio 

Let X and N be continuous time, mean zero WSS random processes. Suppose that X has power 
spectral density Sxiy>) = M^|M<w }i an d that A has power spectral density Sn(u>) = a 2 for all 
u. Suppose also that X and N are uncorrelated with each other. Think of X as a signal, and N 
as noise. Suppose the signal plus noise X + N is passed through a linear time-invariant filter with 
transfer function H, which you are to specify. Let X denote the output signal and A denote the 
output noise. What choice of H, subject the constraints (i) |if(u;)| < 1 for all u, and (ii) (power of 
X) > (power of X)/2, minimizes the power of A? 
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8.19 Finding the envelope of a deterministic signal 

(a) Find the complex envelope z(t) and real envelope \z(t)\ ofx(t) = cos(2-7r(1000)i)+cos(27r(1001)t), 
using the carrier frequency f c = 1000. 5Hz. Simplify your answer as much as possible. 

(b) Repeat part (a), using f c = 995Hz. (Hint: The real envelope should be the same as found in 
part (a).) 

(c) Explain why, in general, the real envelope of a narrowband signal does not depend on which 
frequency f c is used to represent the signal (as long as f c is chosen so that the upper band of the 
signal is contained in an interval [f c — a, f c + a] with a « f c .) 

8.20 Sampling a signal or process that is not band limited 

(a) Fix T > and let u = n/T. Given a finite energy signal x, let x° be the band-limited signal 
with Fourier transform x°(oj) = /{u< Wo } S^L-oo ^( w + ^ nu ^o)- Show that x(nT) = x°(nT) for all 
integers n. (b) Explain why x°(t) = J2^=-oo x(nT)sinc ( t ~^ T ) . 

(c) Let X be a mean zero WSS random process, and let R° x be the autocorrelation function for 
power spectral density S x (u>) defined by S x (u>) = I{\u,\<u> } S^L-oo Sx(w + 2no; ). Show that 
Rx{nT) = R° x {nT) for all integers n. (d) Explain why the random process Y defined by Yf = 
XlnL-oo -^nTsinc ( *~y ) is WSS with autocorrelation function R° x . (e) Find S x in case Sx(w) — 
exp(— a\tv\) for w£l. 

8.21 A narrowband Gaussian process 

Let X be a real- valued stationary Gaussian process with mean zero and Rx{t) = cos(2-7r(30r))(sinc(6r)) 
(a) Find and carefully sketch the power spectral density of X. (b) Sketch a sample path of X. (c) 
The process X can be represented by X% = Re(Zte 27T ^ 0t ) , where Z% = Ut + jVt for jointly stationary 
narrowband real- valued random processes U and V. Find the spectral densities Sjj, Sy, and Sjjv- 

(d) Find P{|Z33J > 5}. Note that \Z t \ is the real envelope process of X. 

8.22 Another narrowband Gaussian process 

Suppose a real- valued Gaussian random process R = {Rt : t € M.) with mean 2 and power spectral 
density <Sr(27t/) = e"\' 10 is fed through a linear time- invariant system with transfer function 

„,_ . N f 0.1 5000 < I/I < 6000 
H ^f) = { else " 

(a) Find the mean and power spectral density of the output process X = {Xt : £ £ R). (b) Find 
-P{-^25 > 6}. (c) The random process X is a narrowband random process. Find the power spectral 
densities Sjj, Sy and the cross spectral density Suv of jointly WSS baseband random processes U 
and V so that 

X t = U t cos(2irf c t) - V t sm(2ir f c t), 

using f c = 5500. (d) Repeat part (c) with f c = 5000. 

8.23 Another narrowband Gaussian process (version 2) 

Suppose a real-valued Gaussian white noise process iV (we assume white noise has mean zero) with 
power spectral density S , 7v(27r/) = -if- for / £ R is fed through a linear time-invariant system with 



2 
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transfer function H specified as follows, where / represents the frequency in gigahertz (GHz) and 
a gigahertz is 10 9 cycles per second. 



H(2irf) 



1 19.10 < |/| < 19.11 

"o.oi^ 1 19.11 < |/| < 19.12 
else 



(a) Find the mean and power spectral density of the output process X = (Xt :iel). 

(b) Express P{X2§ > 2} in terms of N and the standard normal complementary CDF function Q. 

(c) The random process X is a narrowband random process. Find and sketch the power spectral 
densities Sjj, Sy and the cross spectral density Suv of jointly WSS baseband random processes U 
and V so that 

X t = U t cos(27rf c t) - V t sm(2ir f c t), 

using f c = 19.11 GHz. 

(d) The complex envelope process is given by Z = U + jV and the real envelope process is given 
by \Z\. Specify the distributions of Zt and \Zt\ for t fixed. 

8.24 Declaring the center frequency for a given random process 

Let a > and let g be a nonnegative function on R which is zero outside of the interval [a, 2a]. 
Suppose X is a narrowband WSS random process with power spectral density function Sx{w) = 
g(\u>\), or equivalently, Sx(w) = d(^>) + <?(— w). The process X can thus be viewed as a narrowband 
signal for carrier frequency u c , for any choice of lo c in the interval [a, 2a]. Let U and V be the 
baseband random processes in the usual complex envelope representation: Xt = Re{{Ut+ jVt)e 3u ' ct ) . 

(a) Express Sjj and Suv m terms of g and lo c . 

(b) Describe which choice of lo c minimizes /^° \Suy(uj)\ 2 -^. (Note: If g is symmetric arround 
some frequency v, then uj c = v. But what is the answer otherwise?) 

8.25 * Cyclostationary random processes 

A random process X = (Xt : t € R) is said to be cyclostationary with period T, if whenever s is 
an integer multiple of T, X has the same finite dimensional distributions as (Xt+ S : t £ 1). This 
property is weaker than stationarity, because stationarity requires equality of finite dimensional 
distributions for all real values of s. 

(a) What properties of the mean function \xx and autocorrelation function Rx does any second 
order cyclostationary process possess? A process with these properties is called a wide sense 
cyclostationary process. 

(b) Suppose X is cyclostationary and that U is a random variable independent of X that is uniformly 
distributed on the interval [0, T\. Let Y = (Yt : t G R) be the random process defined by Yj = Xt+u- 
Argue that Y is stationary, and express the mean and autocorrelation function of Y in terms of 
the mean function and autocorrelation function of X. Although X is not necessarily WSS, it is 
reasonable to define the power spectral density of X to equal the power spectral density of Y. 

(c) Suppose B is a stationary discrete-time random process and that a is a deterministic function. 
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Let X be defined by 

oo 

X t = Y,9(t-nT)B n . 

— oo 

Show that X is a cyclostationary random process. Find the mean function and autocorrelation 
function of X in terms g, T, and the mean and autocorrelation function of B. If your answer is 
complicated, identify special cases which make the answer nice. 

(d) Suppose Y is defined as in part (b) for the specific X defined in part (c). Express the mean 
/iy, autocorrelation function Ry, and power spectral density Sy in terms of g, T, hb, and Sb- 

8.26 * Zero crossing rate of a stationary Gaussian process 

Consider a zero- mean stationary Gaussian random process X with Sx(2ftf) = |/| — 50 for 50 < 
|/| < 60, and Sx(27r/) = otherwise. Assume the process has continuous sample paths (it can be 
shown that such a version exists.) A zero crossing from above is said to occur at time t if X{t) = 
and X(s) > for all s in an interval of the form [t — e, t) for some e > 0. Determine the mean rate 
of zero crossings from above for X. If you can find an analytical solution, great. Alternatively, you 
can estimate the rate (aim for three significant digits) by Monte Carlo simulation of the random 
process. 
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Chapter 9 

Wiener filtering 



9.1 Return of the orthogonality principle 

Consider the problem of estimating a random process X at some fixed time t given observation of 
a random process Y over an interval [a, b]. Suppose both X and Y are mean zero second order 
random processes and that the minimum mean square error is to be minimized. Let Xt denote the 
best linear estimator of Xt based on the observations (Y s : a < s < b). In other words, define 

V = {c\Y Sl + • • • + c n Y Sn : for some constants n, a < si, . . . , s n < b, and c\, . . . , c n } 

and let V be the m.s. closure of V, which includes V and any random variable that is the m.s. 
limit of a sequence of random variables in V D . Then Xt is the random variable in V that minimizes 
the mean square error, 2£[|Xf — -Xt| 2 ]. By the orthogonality principle, the estimator Xt exists and it 
is unique in the sense that any two solutions to the estimation problem are equal with probability 
one. 

Perhaps the most useful part of the orthogonality principle is that a random variable W is equal 
to X t if and only if (i) W £ V and (ii) (X t - W) _L Z for all Z <E V. Equivalently, W is equal to 
Xt if and only if (i) W £ V and (ii) (Xt — W) _L Y u for all u £ [a, b]. Furthermore, the minimum 
mean square error (i.e. the error for the optimal estimator Xt) is given by -E[|X t | 2 ] — i?[|JQ| 2 ]. 

Note that m.s. integrals of the form J h(t,s)Y s ds are in V, because m.s. integrals are m.s. 
limits of finite linear combinations of the random variables of Y. Typically the set V is larger than 
the set of all m.s. integrals of Y. For example, if u is a fixed time in [a, b] then Y u £ V. In addition, 
if Y is m.s. differentiable, then Y^ is also in V. Typically neither Y u nor Y^ can be expressed as a 
m.s. integral of (Y s : s £ R). However, Y u can be obtained as an integral of the process Y multiplied 
by a delta function, though the integration has to be taken in a generalized sense. 

The integral J h(t, s)Y s ds is the linear MMSE estimator if and only if 



or equivalently 



E 



X t - I h(t,s)Y s ds _L Y u for ue[a,b] 

J a 

(x t - J h(t,s)Y s ds)Y: 



b 

for u £ [a, b] 
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or equivalently 

,6 

Rxy(t,u) = / h(t,s)RY(s,u)ds foruG[a,6]. 

J a 

Suppose now that the observation interval is the whole real line R and suppose that X and Y 
are jointly WSS. Then for t and v fixed, the problem of estimating Xt from (Y s : s G R) is the 
same as the problem of estimating -X"t+„ from (Y s+V : s G R). Therefore, if /i(i, s) for £ fixed is the 
optimal function to use for estimating X t from (Y s : s G R), then it is also the optimal function to 
use for estimating Xt +V from (Y s+V : s G R). Therefore, /i(t, s) = /i(t + v, s + v), so that /i(t, s) is a 
function of i — s alone, meaning that the optimal impulse response function h corresponds to a time- 
invariant system. Thus, we seek to find an optimal estimator of the form Xt = J_ hit — s)Y s ds. 
The optimality condition becomes 

/oo 
h(t - s)Y s ds _L Y u for u G R 
-oo 

which is equivalent to the condition 

/■DC' 
h(t - s)Ry(s - u)ds for u G R 
-oo 

or Rxy — h*Ry- In the frequency domain the optimality condition becomes Sxy{u) = H{uj)Sy{uj) 
for all oj. Consequently, the optimal filter H is given by 

H(u) ~ SXY{UJ) 



S y (u>) 
and the corresponding minimum mean square error is given by 

E[\X t - X t f] = E[\X t \*\ - E[\X t \*\ = J_l (S X ( U ) - l ^f) % 

Example 9.1.1 Consider estimating a random process from observation of the random process 
plus noise, as shown in Figure 9.1. Assume that X and ./V are jointly WSS with mean zero. Suppose 



H h 

\n 



X 



Figure 9.1: An estimator of a signal from signal plus noise, as the output of a linear filter. 

X and ./V have known autocorrelation functions and suppose that Rxn = 0, so the variables of the 
process X are uncorrelated with the variables of the process N. The observation process is given 
by Y = X + N. Then Sxy = Sx and Sy = Sx + 5V, so the optimal filter is given by 

H(u) - SxY ^ - Sx ^ 



Sy(v) S x (u) + S n (lo) 
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The associated minimum mean square error is given by 

S x (lo) 2 \ dto 



E[\X t - X t \ 2 } = f" (s x (co 

J — oo V 



oo 



Sx(u>) + Sn(ui) J 2ir 
Sx(w)Sn(u) duj 



Sx{u) + Sn(oj) 2ir 



Example 9.1.2 This example is a continuation of the previous example, for a particular choice 
of power spectral densities. Suppose that the signal process X is WSS with mean zero and power 
spectral density Sx{w) = , 2 , suppose the noise process N is WSS with mean zero and power 

spectral density 4+ ^ 2 , and suppose Sxn = 0. Equivalently, Rx(t) = ^~2~ , Rn(t) = e _2 M and 
Rxn = 0. We seek the optimal linear estimator of X t given (Y s : s G M), where Y = X + N. 
Seeking an estimator of the form 

/oo 
h(t - s)Y s ds 
-oo 

we find from the previous example that the transform H of h is given by 

Sx(w) iqb* _. 4 + lo 2 



H{, 



w 



S x {u) + S N {u) j^ + j^ 8 + 5 W 2 



l+LO 2 ' i+LO 

We will find h by finding the inverse transform of H. First, note that 

4 + u 2 _ l+u 2 f _1 f 



8 + 5w 2 8 + 5w 2 8 + 5^2 5 8 + 5w 2 
We know that lS(t) <-> I. Also, for any a > 0, 



c -a\t\ '« 


(9.1) 


C 1 , 2 ' 


so 




1 | _/ 1 [5] 2^ 


,, f i V-^i'i 


8 + 5cj2 | +CL ,2 ^5 -2 V 8^/ (f+o; 2 ) 


1 / — / "- 

\aVwJ 


Therefore the optimal filter is given in the time domain by 




/j(,A _ X(f\ f | r " 


->/fl*l 


5 V5^ioy 




The associated minimum mean square error is given by (one 


way to do the integration is to use the 


fact that if k <-» K then /^ A"(w)^ = fc(0)): 




F[|Y |2l /°° 5 X (^)^H do; /•- 

U * <M J- 00 S X (u)+S N (u)2ir i_ 00 8 4 


4 du_ f 1 \_ 1 


- 5w 2 2tt V4v / 107 VlO 



In an example later in this chapter we will return to the same random processes, but seek the best 
linear estimator of Xt given (Y s : s <t). 
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9.2 The causal Wiener filtering problem 

A linear system is causal if the value of the output at any given time does not depend on the 
future of the input. That is to say that the impulse response function satisfies hit, s) = for s > t. 
In the case of a linear, time-invariant system, causality means that the impulse response function 
satisfies h{r) = for r < 0. Suppose X and Y are mean zero and jointly WSS. In this section we 
will consider estimates of X given Y obtained by passing Y through a causal linear time-invariant 
system. For convenience in applications, a fixed parameter T is introduced. Let X t+T \ t be the 
minimum mean square error linear estimate of Xt+T given (Y s : s < t). Note that if Y is the same 
process as X and T > 0, then we are addressing the problem of predicting Xf+r from (X s : s < t). 
An estimator of the form J_ h{t — s)Y s ds is sought such that h corresponds to a causal system. 
Once again, the orthogonality principle implies that the estimator is optimal if and only if it satisfies 



X t+T - h(t- s)Y s ds _L Y u for u < t 

J — oo 

which is equivalent to the condition 

/oo 
h(t — s)Ry(s — u)ds for u < t 
-oo 

or Rxyit + T — u) = h * Ry(t — u). Setting r = t — u and combining this optimality condition 
with the constraint that h is a causal function, the problem is to find an impulse response function 
h satisfying: 

Rxy(t + T) = h*R Y {r) forr>0 (9.2) 

h{v) = for v < (9.3) 

Equations (9.2) and (9.3) are called the Wiener-Hopf equations. We shall show how to solve them 
in the case the power spectral densities are rational functions by using the method of spectral 
factorization. The next section describes some of the tools needed for the solution. 

9.3 Causal functions and spectral factorization 

A function h on R is said to be causal if h{r) = for r < 0, and it is said to be anticausal if 
h(r) = for r > 0. Any function h on R can be expressed as the sum of a causal function and 
an anticausal function as follows. Simply let u(£) = /{i>o} an d notice that h(t) is the sum of the 
causal function u(t)h(t) and the anticausal function (1 — u(t))h(t). More compactly, we have the 
representation h = uh + (1 — u)h. 

A transfer function H is said to be of positive type if the corresponding impulse response function 
h is causal, and H is said to be of negative type if the corresponding impulse response function is 
anticausal. Any transfer function can be written as the sum of a positive type transfer function 
and a negative type transfer function. Indeed, suppose H is the Fourier transform of an impulse 
response function h. Define [H] + to be the Fourier transform of uh and [H]_ to be the Fourier 
transform of (1 — u)h. Then [H]+ is called the positive part of H and [H]- is called the negative 
part of H. The following properties hold: 
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• H = [H} + + [H}_ (because h = uh + (1 - u)h) 

• [H] + = H if and only if H is positive type 

• [H]- = if and only if H is positive type 

• [[#]+]_ = for any H 

. [[#]+]+ = [#]+ and [[#]_]_ = [H]. 

• [ff + G}+ = [H] + + [G} + and [H + G]_ = [S]_ + [G]_ 

Note that n/i is the casual function that is closest to h in the I? norm. That is, uh is the 
projection of h onto the space of causal functions. Indeed, if k is any causal function, then 



\h(t) - k(t)\ 2 dt 



> 



/-oo 

\h{t)\ 2 dt+ / |/i(*)~ A;(t)| 2 dt 

-oo JO 



2. 



|/l(t)| 2 d* 



(9.4) 



and equality holds in (9.4) if and only if k = uh (except possibly on a set of measure zero). By 
Parseval's relation, it follows that [H]+ is the positive type function that is closest to H in the I? 
norm. Equivalently, [H]+ is the projection of H onto the space of positive type functions. Similarly, 
[H]_ is the projection of H onto the space of negative type functions. Up to this point in these 
notes, Fourier transforms have been defined for real values of to only. However, for the purposes 
of factorization to be covered later, it is useful to consider the analytic continuation of the Fourier 
transforms to larger sets in C. We use the same notation H{oj) for the function H defined for real 
values of u only, and its continuation defined for complex u. The following examples illustrate the 
use of the projections [ ]+ and [ ]_, and consideration of transforms for complex ui. 



Example 9.3.1 Let g(t) = e a '*' for a constant a > 0. The functions g, ug and (1 — u)g are 



u(t)g(t) 



<l-u(t))g(t) 



Figure 9.2: Decomposition of a two-sided exponential function. 
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pictured in Figure 9.2. The corresponding transforms are given by: 

POO I 

\G\+(u) = / e- at e-^ t dt= 

Jo J^^ 

\G\-(w) = / e at e~ Jwt dt= 

J-oc -JW 



juj + a 



ju + a 

g(«) = [G] + («) + [Gj_( w ; 2n 



a; 2 + a 2 



Note that [G] + has a pole at w = ja, so that the imaginary part of the pole of [G]+ is positive. 
Equivalently, the pole of [G] + is in the upper half plane. 

More generally, suppose that G(lj) has the representation 

iVi N 

G( W ) = V^- + V — ^— 

n=l J n n=/Yi+l J ™ 

where Re(a n ) > for all n. Then 

iVi TV 

[G]+( W ) = V^_ [G]_(u,) = V — ^ 



Example 9.3.2 Let G be given by 



1-c^ 2 



(ju + l)(ju + 3){ju-2) 



Note that G has only three simple poles. The numerator of G has no factors in common with the 
denominator, and the degree of the numerator is smaller than the degree of the denominator. By 
the theory of partial fraction expansions in complex analysis, it therefore follows that G can be 

written as 

n< \ 7i , 72 73 
Giu) = — — + , + 

JLO + 1 JUJ + 3 JUJ — A 

In order to identify 71, for example, multiply both expressions for G by {juj + 1) and then let 
juj = — 1. The other constants are found similarly. Thus 

l + (-l) 2 _ _1 

3 



7i 



l-UJ 2 




(jw + fyiju- 


-2) 


1-OJ 2 




(jw+ l){ju- 


-2) 


1-UJ 2 





\JUJ = -1 

72 = t- — ~^vr- ^ = / o ,"iw" o — ^r = 1 



73 = (juJ+l)(juJ + 3) 



(-l + 3)(-l- 


-2) 


1 + 3 2 




(-3 + l)(-3- 


-2) 


1 + 2 2 


1 



ju=2 (2 + l)(2 + 3) 3 
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Consequently, 



[G] + H 



+ 



3(ju + 1) ju + 3 



and [G]_(w) 



3(ju - 2) 



t'cj+a) • Multiplication by e JwT in the frequency domain 
represents a shift by T in the time domain, so that 



Example 9.3.3 Suppose that G{u) 



g(t) 



-a(t~T) t>T 



t<T ' 

as pictured in Figure 9.3. Consider two cases. First, if T > 0, then 5 is causal, G is positive type, 



7>0; 



g(t) 



T 



T<0: 




Figure 9.3: Exponential function shifted by T. 

and therefore [G]+ = G and [G]_ = 0. Second, if T < then 

... ... f e aT e~ at t > 

5(<)u(t) = 1 t < 

so that [G]+H = ^ and [G]_(w) = G(w) - [G\+(u) = e ~'£+g T • We can also find [G]_ by 
computing the transform of (1 — u(t))g(t) (still assuming that T < 0): 





/•0 aT-(a+ju))t 

Jt -(a + ju) 



t=T 



(ju + a) 



Example 9.3.4 Suppose H is the transfer function for impulse response function h. Let us unravel 
the notation and express 



00 



[e jujT H(u) 



2 dui 

2^ 
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in terms of h and T. (Note that the factor e^ wT is used, rather than e~^ T as in the previous 
example.) Multiplication by e-? wT in the frequency domain corresponds to shifting by — T in the 
time domain, so that 

e j " T H(u) <-► h{t + T) 

and thus 

[e jwT H(io)] + <-> u(t)h(t + T) 

Applying Parseval's identity, the definition of u, and a change of variables yields 

[e jujT H{u) 



2 duj 

2^ 


POO 

= / \u{t)h{t + T)\ 2 dt 

J — oo 




/•OO 

= / |/i(£ + T)| 2 cft 

•/O 




/•OO 

= / \h(t)\ 2 dt 



■JT 

The integral decreases from the energy of h to zero as T ranges from — oo to oo. 



Example 9.3.5 Suppose [H]- = \K\- = 0. Let us find [HK]-. As usual, let h denote the inverse 
transform of H, and k denote the inverse transform of K. The supposition implies that h and k 
are both causal functions. Therefore the convolution h * k is also a causal function. Since HK is 
the transform of h * k, it follows that HK is a positive type function. Equivalently, [HK]- = 0. 

The decomposition H = [H] + + [-H]- is an additive one. Next we turn to multiplicative 
decomposition, concentrating on rational functions. A function H is said to be rational if it can 
be written as the ratio of two polynomials. Since polynomials can be factored over the complex 
numbers, a rational function H can be expressed in the form 

H ,. = U" + Pi)(Ju> + fo) • • ■ (ju + Pk) 
(ju + ai)(ju + a 2 ) • • • (ju + a N ) 

for complex constants 7, a\, . . . , a_/v, j3i, . . . , (3k- Without loss of generality, we assume that {a.{\ n 
{Pj} = 0. We also assume that the real parts of the constants a\, . . . , ajv, Pi, ■ ■ ■ , Pk are nonzero. 
The function H is positive type if and only if Re{ai) > for all i, or equivalently, if and only if all 
the poles of H{uj) are in the upper half plane Im(u) > 0. 

A positive type function H is said to have minimum phase if Re(Pi) > for all i. Thus, a 
positive type function H is minimum phase if and only if 1/H is also positive type. 

Suppose that Sy is the power spectral density of a WSS random process and that Sy is a 
rational function. The function Sy, being nonnegative, is also real- valued, so Sy = S Y - Thus, if 
the denominator of Sy has a factor of the form jui + a then the denominator must also have a 
factor of the form — ju + a* . Similarly, if the numerator of Sy has a factor of the form jej + f3 then 
the numerator must also have a factor of the form — jco + /?* . 
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Example 9.3.6 The function Sy given by 

8 + 5lo 2 
S Y (u) 



(l+cu 2 )(4 + ^ 2 ) 
can be factored as 

(jw + 2)(jcj + 1) (-jcj + 2)(-jcu+1) 

V v ' V „ ' 

S+(W) Sy(U>) 

where Sy is a positive type, minimum phase function and S Y is a negative type function with 

Sy = {Sy) ■ 

Note that the operators [ ]+ and [ ]_ give us an additive decomposition of a function H into 
the sum of a positive type and a negative type function, whereas spectral factorization has to do 
with products. At least formally, the factorization can be accomplished by taking a logarithm, 
doing an additive decomposition, and then exponentiating: 

S x {w) = exp([ln5x(a;)] + )exp([lnS , x(w)]-) . (9.6) 

V v ' V v ' 

5+ (u>) S~ (w) 

Notice that if h <-> H then, formally, 

h*h h*h*h /N H 2 H 2 

l + h-\ — -\ : <-> exp(iJ) = 1 + if H H : 

2! 3! 2! 3! 

so that if if is positive type, then exp(if) is also positive type. Thus, the factor S x in (9.6) is 
indeed a positive type function, and the factor S x is a negative type function. Use of (9.6) is called 
the cepstrum method. Unfortunately, there is a host of problems, both numerical and analytical, 
in using the method, so that it will not be used further in these notes. 

9.4 Solution of the causal Wiener filtering problem for rational 
power spectral densities 

The Wiener-Hopf equations (9.2) and ( 9.3) can be formulated in the frequency domain as follows: 
Find a positive type transfer function H such that 

[e^ T S X Y-HS Y ] + = (9.7) 

Suppose Sy is factored as Sy = SySy such that Sy is a minimum phase, positive type transfer 
function and Sy = (Sy)*. Then Sy and — are negative type functions. Since the product of 
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two negative type functions is again negative type, (9.7) is equivalent to the equation obtained by 
multiplying the quantity within square brackets in (9.7) by — , yielding the equivalent problem: 

Find a positive type transfer function H such that 

~e^ T S XY 



& 



HS\ 



Y 



(9i 



The function HSy, being the product of two positive type functions, is itself positive type. Thus 
(9.8) becomes 

~e^ T S X Y 



SZ 



HS+ = 



Solving for H yields that the optimal transfer function is given by 



II 



1 



e^ T S 



XY 



^>v 



(9.9) 



The orthogonality principle yields that the mean square error satisfies 
E[\X t+ T - X t+T \ t \ ] = E[\X t+T \ ] - E[\X t+T \ t \ ] 



Rx(0) 
Rx(0) 



\H(u)\ 2 S Y (u) 



duj 



*" T S X Y 



+ 



(111! 
2^ 



(9.10) 



where we used the fact that \Sy\ 2 = Sy- 



Another expression for the MMSE, which involves the optimal filter h, is the following: 
MMSE = E[(X t+T - X t+T]t )(X t+T - X t+T]t )*} 

= E[(X t+T - X t+T \ t )X* t+T ] = R X (0) ~ R xx (t, t + T) 

poo 

= R X (P)- / h(s)R XY (s + T)ds. 



Exercise Evaluate the limit as T — > — cxd and the limit as T — > oo in (9.10). 



Example 9.4.1 This example involves the same model as in an example in Section 9.1, but here 
a causal estimator is sought. The observed random process is Y = X + N, were X is WSS with 
mean zero and power spectral density Sx(w) = 14 _ 1 (;2 , ./V is WSS with mean zero and power spectral 
density Sn(u) 



4 _/^2 1 and Sxn = 0. We seek the optimal casual linear estimator of X t given 
(Y s : s < t). The power spectral density of Y is given by 



S y (uj) = Sx(w) + S n (lj) 



8 + 5oj 2 



(l+cu 2 )(4 + o; 2 ) 
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and its spectral factorization is given by (9.5), yielding S Y and S Y - Since Rxn — it follows that 

1 



Therefore 



where 



Therefore 



and thus 



Sxy(w) = Sx(v) 



S X y{u) 



(ju + l)(-ju + 1) 
(-ju + 2) 



S Y (u 



7i + 72 



JLO + 1 



-ju + 



7i 



12 



-ju + 2 



v^+v^ 



ju>=-l 



+ 2 



VE(-ju + ^j) 

-3" + 2 _ 

V5(jcj + 1) . fi ^5 + ^8 

Sxy(v) 



H(u) 



S Y (u) 

7i(jo; + 2) 
V^(j" + \/§) 



7i 

iw + i 



5 + 2V/10 



1 + 



jw + 



so that the optimal causal filter is 



h(t) 



5(t) + (2 



)u(t)e" 



(9.11) 



5 + 2^10 \^" v " y V 5 J / 

Finally, by (9.10) with T = 0, (9.11), and (9.1), the minimum mean square error is given by 

r °° 7 2 dw _ 1 7l 2 



£[|X t - X t | 2 ] = R x (0) 



1 + uj 2 2tv 2 



0.3246 



which is slightly larger than —p= ~ 0.3162, the MMSE found for the best noncausal estimator (see 



the example in Section 9.1), and slightly smaller than „-, the MMSE for the best "instantaneous" 
estimator of X% given Yj, which is -y-. 
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Example 9.4.2 A special case of the causal filtering problem formulated above is when the ob- 
served process Y is equal to X itself. This leads to the pure prediction problem. Let X be a 
WSS mean zero random process and let T > 0. Then the optimal linear predictor of Xt+T given 
(X s : s < t) corresponds to a linear time-invariant system with transfer function H given by 
(because Sxy = Sx, Sy = Sx, Sy = S x , and S Y = S x ): 

H = ^[S x e^ T ] + (9.12) 

To be more specific, suppose that Sx(w) = 4 , . Observe that lo 4 + 4 = (to 2 + 2j)(u> 2 — 2j). Since 
2j — (1 + j) 2 , we have (lo 2 + 2j) = (u + l+j)(u> — l—j). Factoring the term {uj 2 — 2j) in a similar 
way, and rearranging terms as needed, yields that the factorization of Sx is given by 

SX{UJ) = {JLO + {l+ 3 )){JLO + {l- 3 )) (- juJ + (l+j))(-JU+(l-j)) 

^ ' V * ^ V * 

so that 



S+(u) 



{ju + {1 + j))(ju + (1 - j)) 

71 72 



ju + (1 + j) jw + (1 - j) 



where 



71 



72 



1 




-(i+i) 

-i+i 


J 


jw + (i - j) 
l 


2 
-J 


jw + (i + j) 


2 



yielding that the inverse Fourier transform of S x is given by 



S+ <-> ^-(i+J\(()- J - e -N)y t ) 



Hence 



so that 



. ... . l e -(l+3)(t+T) _ l e -(l-j)(t+T) t > _ T 

XK ' else 



[S+(u)e 



Sutn Je~ {1+J)T Je- (1 - j)T 



2(ju + (l+j)) 2(ju + (l-j)) 
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The formula (9.12) for the optimal transfer function yields 

je-WTiju + (1 - j)) je-^-^Uu + (1 + j)) 



H(l 



2 2 

r^iTfi i a\ „-iT(i a\ a, .1. 

-T 



e jT (l + j) - e~3 T (l - j) ju(ei T - e^ T ) 



+ 



2j 2j 

= e~ [cos(T) + sin(T) + juj sin(T)] 

so that the optimal predictor for this example is given by 

X t+T \t = X t e- T (cos(T) + sin(T)) + X; e - T sin(r) 



9.5 Discrete time Wiener filtering 

Causal Wiener filtering for discrete-time random processes can be handled in much the same way 
that it is handled for continuous time random processes. An alternative approach can be based 
on the use of whitening filters and linear innovations sequences. Both of these approaches will be 
discussed in this section, but first the topic of spectral factorization for discrete-time processes is 
discussed. 

Spectral factorization for discrete time processes naturally involves z-transforms. The z trans- 
form of a function (h^ : k G Z) is given by 



H(z) = J2 h W 



z~ k 



for z G C. Setting z = e 1U) yields the Fourier transform: H{u) = 7i{e 3U] ) for < to < 2tt. Thus, the 
z-transform H restricted to the unit circle in C is equivalent to the Fourier transform H on [0, 2ir], 
and 7i.(z) for other z G C is an analytic continuation of its values on the unit circle. 

Let h(k) = h*{ — k) as before. Then the z-transform of h is related to the z-transform 7i of h as 
follows: 

oo oo oo / oo \ * 

J2 h(k)z~ k = J2 h*(-k)z~ k = J2 h*(l)z l =( J2 h{l){l/z*)- l \ =H*(1/Z*) 

k=—oo k=—oo l=—oo \l=—oo / 

The impulse response function h is called causal if h(k) = for k < 0. The z-transform 7i is 
said to be positive type if h is causal. Note that if H is positive type, then limui^^ 7i{z) = h(0). 
The projection [H]+ is defined as it was for Fourier transforms-it is the z transform of the function 
u(k)h(k), where u(k) = I^>o}- (We will not need to define or use [ ]_ for discrete time functions.) 

If X is a discrete-time WSS random process with correlation function Rx, the z-transform of 
Rx is denoted by Sx- Similarly, if X and Y are jointly WSS then the z-transform of Rxy is 
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denoted by Sxy ■ Recall that if Y is the output random process when X is passed through a linear 
time-invariant system with impulse response function h, then X and Y are jointly WSS and 

Ryx = h* Rx Rxy — h * Rx Ry = h* h* Rx 

which in the z-transform domain becomes: 

S Y x(z) = H{z)Sx{z) Sxy(z) = H*(l/z*)Sx(z) S Y {z) = H(z)H*(l/z*)S x (z) 



Example 9.5.1 Suppose Y is the output process when white noise W with R\y{k) = I{k=o\ 1S 
passed through a linear time invariant system with impulse response function h{k) = p k I\k>o}, 
where p is a complex constant with \p\ < 1. Let us find Ti, Sy, and Ry- To begin, 

oo 1 

and the z-transform of h is i_ * z - Note that the z-transform for h converges absolutely for \z\ > \p\, 
whereas the z-transform for h converges absolutely for \z\ < l/\p\- Then 

S Y (z) = H{z)H*{l/z*)S x {z) 



(l-p/z)(l-p*z) 

The autocorrelation function Ry can be found either in the time domain using Ry = h * h * R\y 
or by inverting the z- transform Sy. Taking the later approach, factor out z and use the method of 
partial fraction expansion to obtain 

Sy(z) 



(z- p)(l-p*z) 



1 1 

+ 



(l-|p| 2 )(z-p) ((l/p*)-p)(l-p* Z ) 
1/1 zp* 



which is the z-transform of 



(1 — \p\ 2 ) \1 — p/z 1 — p*z 



<— A;>0 



k < 



k i-|pP 
The z-transform Sy of Ry converges absolutely for \p\ < z < l/\p\- 

Suppose that 7i{z) is a rational function of z, meaning that it is a ratio of two polynomials 
of z with complex coefficients. We assume that the numerator and denominator have no zeros 
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in common, and that neither has a root on the unit circle. The function Ti is positive type (the 
z-transform of a causal function) if its poles (the zeros of its denominator polynomial) are inside 
the unit circle in the complex plane. If Ti is positive type and if its zeros are also inside the unit 
circle, then h and Ti are said to be minimum phase functions (in the time domain and z-transform 
domain, respectively). A positive-type, minimum phase function Ti has the property that both 
Ti and its inverse l/7i are causal functions. Two linear time-invariant systems in series, one with 
transfer function Ti and one with transfer function l/Ti, passes all signals. Thus if H is positive 
type and minimum phase, we say that Ti is causal and causally invertible. 

Assume that Sy corresponds to a WSS random process Y and that Sy is a rational function 
with no poles or zeros on the unit circle in the complex plane. We shall investigate the symmetries 
of Sy, with an eye towards its factorization. First, 

R Y = Ry so that Sy(z) = S Y (l/z*) (9.13) 

Therefore, if zq is a pole of Sy with zq / 0, then 1/zg is also a pole. Similarly, if zq is a zero of 
Sy with zq / 0, then 1/zq is also a zero of Sy. These observations imply that Sy can be uniquely 
factored as 

S Y (z) = S+(z)Sy(z) 

such that for some constant [3 > 0: 

• Sy is a minimum phase, positive type z-transform 

• lim| 2 .|_ >0O «S^(2;) = (3 

There is an additional symmetry if Ry is real-valued: 

oo oo 

S Y (z)= Yl Rv{k)z- k = J2 ( R Y(k)(z*)~ k T = S Y (z*) (for real- valued R Y ) (9.14) 

k=— oo k=— oo 

Therefore, if Ry is real and if zq is a nonzero pole of Sy, then Zq is also a pole. Combining (9.13) 
and (9.14) yields that if Ry is real then the real- valued nonzero poles of Sy come in pairs: zq and 
1/zq, and the other nonzero poles of Sy come in quadruples: zq, Zq, 1/zq, and 1/zq. A similar 
statement concerning the zeros of Sy also holds true. Some example factorizations are as follows 
(where \p\ < 1 and (3 > 0): 



uy y^j 


1 — p/z 1 — p*z 




S+(Z) Sy(z) 


Sy(z) = 


13(1 - .8/z) 13(1 - .8z) 


(1 - .6/z)(l - .7/z) (1 - .6z)(l - .lz) 




S+(z) S-(z) 


Sy(z) = 


/3 (3 


(l-p/z)(l-p*/z) (l-pz)(l-p*z) 



S+(z) S~(z) 
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An important application of spectral factorization is the generation of a discrete-time WSS 
random process with a specified correlation function Ry- The idea is to start with a discrete-time 
white noise process W with Ryy(k) = ir fc=0 }, or equivalently, with Syy(z) = 1, and then pass it 
through an appropriate linear, time-invariant system. The appropriate filter is given by taking 
H(z) = Sy(z), for then the spectral density of the output is indeed given by 

H{z)H*{l/z*)S w {z) = Sy-{z)Sy{z) = S Y {z) 

The spectral factorization can be used to solve the causal filtering problem in discrete time. 
Arguing just as in the continuous time case, we find that if X and Y are jointly WSS random 
processes, then the best estimator of X n+ T given (Y& : k < n) having the form 

oo 

X n+T \n = J2 Y kK n ~ k ) 
k=—oo 

for a causal function h is the function h satisfying the Wiener-Hopf equations (9.2) and (9.3), and 
the z transform of the optimal h is given by 



*-a 






Finally, an alternative derivation of (9.15) is given, based on the use of a whitening filter. The 
idea is the same as the idea of linear innovations sequence considered in Chapter 3. The first step 
is to notice that the causal estimation problem is particularly simple if the observation process is 
white noise. Indeed, if the observed process Y is white noise with Ry{k) = ^{fc=o} then for each 
k > the choice of h{k) is simply made to minimize the mean square error when X n+ x is estimated 
by the single term h(k)Y n _k- This gives h(k) = Rxy(T + fc)i/ fe>0 i. Another way to get the same 
result is to solve the Wiener-Hopf equations (9.2) and (9.3) in discrete time in case Ry{k) = Ir k=0 \. 
In general, of course, the observation process Y is not white, but the idea is to replace Y by an 
equivalent observation process Z that is white. 

Let Z be the result of passing Y through a filter with transfer function Q(z) = 1/ S + {z). Since 
S + {z) is a minimum phase function, Q is a positive type function and the system is causal. Thus, 
any random variable in the m.s. closure of the linear span of (Z^ : k < n) is also in the m.s. closure 
of the linear span of (!& : k < n). Conversely, since Y can be recovered from Z by passing Z 
through the causal linear time-invariant system with transfer function S + (z), any random variable 
in the m.s. closure of the linear span of (Y^ : k < n) is also in the m.s. closure of the linear span 
of (Zj- : k < n). Hence, the optimal causal linear estimator of X n+ T based on (Yj. : k < n) is equal 
to the optimal causal linear estimator of X n+ x based on (Z& : k < n). By the previous paragraph, 
such estimator is obtained by passing Z through the linear time-invariant system with impulse 
response function Rxz(T + k)I^ >0 y, which has z transform [z T Sxz] + - See Figure 9.4. 
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%(Z) 



,7c ■ . . ^+71? 



[z'S (z)] + 



Figure 9.4: Optimal filtering based on whitening first. 

The transfer function for two linear, time-invariant systems in series is the product of their 
z-transforms. In addition, 



Sxz(z) = G*(l/z*)S XY (z) 



Sxy(z) 

Sy(z) 



Hence, the series system shown in Figure 9.4 is indeed equivalent to passing Y through the linear 
time invariant system with 7i(z) given by (9.15). 



Example 9.5.2 Suppose that X and N are discrete-time mean zero WSS random processes such 



that Rxn = 0. Suppose Sx(z) 



where < p < 1, and suppose that TV is a discrete- 



(l-p/z)(l-pz) 

time white noise with Sn(z) = a 2 and Rj^(k) = a 2 Isk=o\- Let the observed process Y be given 
by Y = X + N. Let us find the minimum mean square error linear estimator of X n based on 
(Yt : k < n). We begin by factoring Sy- 



S Y {z) 



S x {z) +S N (z 
-° 2 p{z 2 -{ 



(z- p)(l- pz) 
^ + ^)z+l} 



+ a z 



p 



u"p> 



(z- p)(l- pz) 

The quadratic expression in braces can be expressed as (z — zq)(z — 1/zq), where zq is the smaller 
root of the expression in braces, yielding the factorization 



S Y (z) 



/?(! - zp/z) (3(1 - zqz) 
(1 - p/z) (1 - pz) 



where j3 



a^p 

zo 



S+(z) S~(z) 

Using the fact Sxy = Sx , and appealing to a partial fraction expansion yields 
Sxy(z) 1 



Sy(z) 



0(l-p/z)(l-zoz) 

1 



+ 



(9.16) 



(3(1 - p/z)(l - z oP ) /3((l/zo) - P )(l - z z) 

The first term in (9.16) is positive type, and the second term in (9.16) is the z transform of a 
function that is supported on the negative integers. Thus, the first term is equal to 



Sxy 
S~ 



+ 
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Finally, dividing by Sy yields that the ^-transform of the optimal filter is given by 

H(z, ' 



or in the time domain 



/3 2 (1 - zop)(l - zq/z) 

h{n) 



-,n T 

P 2 (l ~ zop) 



9.6 Problems 

9.1 A quadratic predictor 

Suppose X is a mean zero, stationary discrete-time random process and that n is an integer with 
n > 1. Consider estimating X n+ \ by a nonlinear one- step predictor of the form 

n n j 

X n+1 = hQ + ^2h 1 {k)X k + ^2^2h 2 {j 1 k)X j X k 
k=i j=i fc=i 

(a) Find equations in term of the moments (second and higher, if needed) of X for the triple 
(ho,hi,ti2) to minimize the one step prediction error: E\(X n+ \ — X n+ \) 2 \. 

(b) Explain how your answer to part (a) simplifies if X is a Gaussian random process. 

9.2 A smoothing problem 

Suppose X and Y are mean zero, second order random processes in continuous time. Suppose the 
MMSE estimator of X$ is to be found based on observation of (Y u : u G [0, 3]U[7, 10]). Assuming the 
estimator takes the form of an integral, derive the optimality conditions that must be satisfied by 
the kernal function (the function that Y is multiplied by before integrating) . Use the orthogonality 
principle. 

9.3 A simple, noncausal estimation problem 

Let X = (Xf : t £ R) be a real valued, stationary Gaussian process with mean zero and autocorre- 
lation function Rx(t) = A 2 s'mc(f t), where A and f are positive constants. Let N = (N t : t € R) 
be a real valued Gaussian white noise process with Rn(t) = o 2 8(t), which is independent of X. 
Define the random process Y = (Yt : t G R) by Y t = Xt + N t . Let Xt = J_ h(t — s)Y s ds, where 
the impulse response function h, which can be noncausal, is chosen to minimize E[D 2 ] for each t, 
where D± = Xt — Xt- (a) Find h. (b) Identify the probability distribution of Df, for t fixed, (c) 
Identify the conditional distribution of Dt given Yj, for t fixed, (d) Identify the autocorrelation 
function, Rp, of the error process D, and the cross correlation function, Rdy- 

9.4 Interpolating a Gauss Markov process 

Let X be a real-valued, mean zero stationary Gaussian process with Rx{t) = e~' T '. Let a > 0. 
Suppose Xq is estimated by Xq = c\X- a + c 2 X a where the constants c\ and c 2 are chosen to 
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\2l 



minimize the mean square error (MSE). 

(a) Use the orthogonality principle to find c\, C2, and the resulting minimum MSE, E[{Xq — X$) 
(Your answers should depend only on a.) 

(b) Use the orthogonality principle again to show that Xq as defined above is the minimum MSE 
estimator of Xq given {X s : \s\ > a). (This implies that X has a two-sided Markov property.) 

9.5 Estimation of a filtered narrowband random process in noise 

Suppose X is a mean zero real- valued stationary Gaussian random process with the spectral density 
shown. 



(a) Explain how X can be simulated on a computer using a pseudo-random number generator that 
generates standard normal random variables. Try to use the minimum number per unit time. How 
many normal random variables does your construction require per simulated unit time? 

(b) Suppose X is passed through a linear time- invariant system with approximate transfer function 
H(2irf) = 10 7 /(10 7 + f 2 ). Find an approximate numerical value for the power of the output. 

(c) Let Zt = Xt + Wt where If is a Gaussian white noise random process, independent of X, with 
R\v( T ) — <5( T )- Find h to minimize the mean square error E[(Xt — Xt) 2 ], where X — h* Z. 

(d) Find the mean square error for the estimator of part (c). 

9.6 Proportional noise 

Suppose X and TV are second order, mean zero random processes such that Rxn = 0, and let 
Y = X + N . Suppose the correlation functions Rx and Rn are known, and that Rn = ^ 2 Rx 
for some nonnegative constant 7 2 . Consider the problem of estimating Xt using a linear estimator 
based on (Y u : a < u < b), where a, b, and t are given times with a < b. 

(a) Use the orthogonality principle to show that if t € [a,b], then the optimal estimator is given by 
Xt = nYt for some constant k, and identify the constant k and the corresponding MSE. 

(b) Suppose in addition that X and N are WSS and that Xt+r is to be estimated from (Y s : s < t). 
Show how the equation for the optimal causal filter reduces to your answer to part (a) in case 
T < 0. 

(c) Continue under the assumptions of part (b), except consider T > 0. How is the optimal filter 
for estimating Xt+T from (Y s : s < t) related to the problem of predicting Xt+T from {X s : s < £)? 

9.7 Predicting the future of a simple WSS process 

Let X be a mean zero, WSS random process with power spectral density Sx(w) = 4 +1 o 2135 - 

(a) Find the positive type, minimum phase rational function S x such that S'x(w) = |S^(u;)| 2 . 

(b) Let T be a fixed known constant with T > 0. Find X t+T i t , the MMSE linear estimator of Xt+T 
given {X s : s < t). Be as explicit as possible. (Hint: Check that your answer is correct in case 
T = and in case T — > 00). 

(c) Find the MSE for the optimal estimator of part (b). 



304 CHAPTER 9. WIENER FILTERING 

9.8 Short answer filtering questions 

(a) Prove or disprove: If H is a positive type function then so is H 2 . (b) Prove or disprove: Suppose 
X and Y are jointly WSS, mean zero random processes with continuous spectral densities such that 
S x (2irf) = unless |/| <e[9012 MHz, 9015 MHz] and 5 f (2tt/) = unless |/| e[9022 MHz, 9025 
MHz]. Then the best linear estimate of X given (Y t : t € R) is 0. (c) Let H(2irf) = sinc(/). Find 
[H} + - 

9.9 On the MSE for causal estimation 

Recall that if X and Y are jointly WSS and have power spectral densities, and if Sy is rational with 
a spectral factorization, then the mean square error for linear estimation of Xt+T using (Y s : s <t) 
is given by 

' 'e^ T S XY 



(MSE) = R X (0) - / 

J — ! 



Y 



duo 



2tt 



+ 
Evaluate and interpret the limits of this expression as T — > — oo and as T — > oo. 

9.10 A singular estimation problem 

Let X t = Aei 27r -> ot , where f > and A is a mean zero complex valued random variable with 
E[A 2 ] = and -E[|j4| 2 ] = o\. Let N be a white noise process with Rn{t) = a 2 N 5(r). Let 
Y t = Xt + Nf. Let X denote the output process when Y is filtered using the impulse response 
function h{r) = ae -(«-i2^.)f/ fe0} . 

(a) Verify that X is a WSS periodic process, and find its power spectral density (the power spectral 
density only exists as a generalized function-i.e. there is a delta function in it). 

(b) Give a simple expression for the output of the linear system when the input is X. 

(c) Find the mean square error, -EflXf — X(| 2 ]. How should the parameter a be chosen to approxi- 
mately minimize the MSE? 

9.11 Filtering a WSS signal plus noise 

Suppose X and N are jointly WSS, mean zero, continuous time random processes with Rxn = 0. 
The processes are the inputs to a system with the block diagram shown, for some transfer functions 
K\(uj) and K2(uj): 



X 



K, 



K 2 



Y=X out +N out 



N 



Suppose that for every value of to, Ki{ui) ^ for i = 1 and i = 2. Because the two subsystems are 
linear, we can view the output process Y as the sum of two processes, X ou t, due to the input X, 
plus N ou t, due to the input N. Your answers to the first four parts should be expressed in terms 
of K\, K2, and the power spectral densities Sx and .S/v- 

(a) What is the power spectral density SV? 

(b) Find the signal-to-noise ratio at the output (the power of X ou t divided by the power of N ou t). 

(c) Suppose Y is passed into a linear system with transfer function H, designed so that the output 
at time t is Xt, the best linear estimator of Xt given (Y s : s £ R). Find H. 
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(d) Find the resulting minimum mean square error. 

(e) The correct answer to part (d) (the minimum MSE) does not depend on the filter K^. Why? 

9.12 A prediction problem 

Let A be a mean zero WSS random process with correlation function Rx(t) = e~' T '. Using the 
Wiener filtering equations, find the optimal linear MMSE estimator (i.e. predictor) of Xt+r based 
on (A s : s < t), for a constant T > 0. Explain why your answer takes such a simple form. 

9.13 Properties of a particular Gaussian process 

Let X be a zero-mean, wide-sense stationary Gaussian random process in continuous time with 
autocorrelation function Rx{t) = (1 + |T|)e _ l T l and power spectral density S'x(w) = (2/(1 + uj 2 )) 2 . 
Answer the following questions, being sure to provide justification. 

(a) Is X mean ergodic in the m.s. sense? 

(b) Is A a Markov process? 

(c) Is A differentiable in the m.s. sense? 

(d) Find the causal, minimum phase filter h (or its transform H) such that if white noise with 
autocorrelation function 5{t) is filtered using h then the output autocorrelation function is Rx- 

(e) Express A as the solution of a stochastic differential equation driven by white noise. 

9.14 Spectral decomposition and factorization 

(a) Let x be the signal with Fourier transform given by x(2irf) = [sinc(100/)e :,27r ^ T ] . Find the 
energy of x for all real values of the constant T. 

(b) Find the spectral factorization of the power spectral density S(ui) = 4 16 2 .inn - (Hint: 1 + 3j 
is a pole of S.) 

9.15 A continuous-time Wiener filtering problem 

Let (At) and (N t ) be uncorrelated, mean zero random processes with Rx{t) = ex p( — 2|i|) and 
<S/v(w) = A /2 for a positive constant N . Suppose that Yt = X t + Nf. 

(a) Find the optimal (noncausal) filter for estimating Xt given (Y s : — oo < s < +oo) and find the 
resulting mean square error. Comment on how the MMSE depends on N . 

(b) Find the optimal causal filter with lead time T, that is, the Wiener filter for estimating Xt+T 
given (Y s : — oo < s < t), and find the corresponding MMSE. For simplicity you can assume that 
T > 0. Comment on the limiting value of the MMSE as T — > oo, as N — > oo, or as N — > 0. 

9.16 Estimation of a random signal, using the KL expansion 

Suppose that A is a m.s. continuous, mean zero process over an interval [a, b], and suppose N is 
a white noise process, with Rxn = and R^(s,t) = a 2 S(s — t). Let (c^ : k > 1) be a complete 
orthonormal basis for L 2 [a, b] consisting of eigenfunctions of Rx, and let (Afc : k > 1) denote the 
corresponding eigenvalues. Suppose that Y = (Yt : a < t < b) is observed. 

(a) Fix an index i. Express the MMSE estimator of (A, <pi) given Y in terms of the coordinates, 
(Y, (pi), (Y, 4>2), ... of F, and find the corresponding mean square error. 

(b) Now suppose / is a function in L 2 [a,b\. Express the MMSE estimator of (A,/) given Y in 
terms of the coordinates ((/, <j)j) : j > 1) of /, the coordinates of Y, the A's, and a. Also, find the 
mean square error. 
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9.17 Noiseless prediction of a baseband random process 

Fix positive constants T and u> , suppose X = (X t : t £ 1) is a baseband random process with 

one-sided frequency limit lo , and let H^ n \uj) = Y^k=o k\ ' wn i cn i s a partial sum of the power 
series of e- 7 . Let X;"L denote the output at time t when X is passed through the linear time 

invariant system with transfer function H^ n '. As the notation suggests, X+Vmt ls an estimator (not 
necessarily optimal) of Xt+T given {X s : s <t). 

(a) Describe X^L, in terms of X in the time domain. Verify that the linear system is causal. 

(b) Show that liuin^^ a n = 0, where a n = maxi w i< Wo \e^ T — H^ n >(u)\. (This means that the power 
series converges uniformly for u £ [— u ,u ].) 

(c) Show that the mean square error can be made arbitrarily small by taking n sufficiently large. 
In other words, show that linin^oo E[\Xt+T ~ ^Q+tiJ 2 ] = ^- 

(d) Thus, the future of a narrowband random process X can be predicted perfectly from its past. 
What is wrong with the following argument for general WSS processes? If X is an arbitrary WSS 
random process, we could first use a bank of (infinitely many) narrowband filters to split X into 
an equivalent set of narrowband random processes (call them "subprocesses") which sum to X. By 
the above, we can perfectly predict the future of each of the subprocesses from its past. So adding 
together the predictions, would yield a perfect prediction of X from its past. 

9.18 Linear innovations and spectral factorization 

Suppose X is a discrete time WSS random process with mean zero. Suppose that the z-transform 
version of its power spectral density has the factorization as described in the notes: Sx(z) = 
S x ~(z)S x (z) such that S x {z) is a minimum phase, positive type function, S x {z) = (S x (l/z*))*, 
and limui_ >0O «Sj£(2) = (3 for some j3 > 0. The linear innovations sequence of X is the sequence X 
such that Xk = X^ — Xu} c _i, where Xu^-i is the MMSE predictor of X^ given (Xi : I < k — 1). 
Note that there is no constant multiplying X^ in the definition of X^. You should use S~^{z), S x (z), 
and /or (3 in giving your answers. 

(a) Show that X can be obtained by passing X through a linear time-invariant filter, and identify 
the corresponding value of 7i. 

(b) Identify the mean square prediction error, i?[|Xfc — ^fclfc-il 2 ]- 

9.19 A singular nonlinear estimation problem 

Suppose X is a standard Brownian motion with parameter a 2 = 1 and suppose TV is a Poisson 
random process with rate A = 10, which is independent of X. Let Y = (Yj : t > 0) be defined by 
Y t = X t + N t . 

(a) Find the optimal estimator of X\ among the estimators that are linear functions of (Yj : < t < 
1) and the constants, and find the corresponding mean square error. Your estimator can include a 
constant plus a linear combination, or limits of linear combinations, of Yt : < t < 1. (Hint: There 
is a closely related problem elsewhere in this problem set.) 

(b) Find the optimal possibly nonlinear estimator of X\ given (Yj : < t < 1), and find the 
corresponding mean square error. (Hint: No computation is needed. Draw sample paths of the 
processes.) 
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9.20 A discrete-time Wiener filtering problem 

Extend the discrete-time Wiener filtering problem considered at the end of the notes to incorporate 
a lead time T. Assume T to be integer valued. Identify the optimal filter in both the z-transform 
domain and in the time domain. (Hint: Treat the case T < separately. You need not identify the 
covariance of error.) 

9.21 Causal estimation of a channel input process 

Let X = (X t : t G R) and N = (N t : t G R) denote WSS random processes with R x (t) = |e _ l T l 

and Rn(t) = S(t). Think of X as an input signal and N as noise, and suppose X and N are 

orthogonal to each other. Let k denote the impulse response function given by fc(r) = 2e _3T /r r >o}, 

and suppose an output process Y is generated according to the block diagram shown: 

X I 1 Y 

k — $^ 

I u 

N 

That is, Y = X * k + N. Suppose Xt is to be estimated by passing Y through a causal filter with 
impulse response function h, and transfer function H. Find the choice of H and h in order to 
minimize the mean square error. 

9.22 Estimation given a strongly correlated process 

Suppose g and k are minimum phase causal functions in discrete-time, with g(0) = k(0) = 1, and 
z-transforms Q and JC. Let W = (W^ : k G Z) be a mean zero WSS process with Sw(w) = 1, let 
Xn = ET=-oo 9(n - i)Wi and Y n = ZT=-oo k(n - i)W t . 

(a) Express Rx, Ry, Rxy, <Sx, <Sy, and Sxy in terms of g, k, Q, JC. 

(b) Find h so that X n \ n = X^-oo Yih{ n ~ *) is the MMSE linear estimator of X n given (Yj, : i < n). 

(c) Find the resulting mean square error. Give an intuitive reason for your answer. 

9.23 Estimation of a process with raised cosine spectrum 

Suppose Y = X + N, where X and N are independent, mean zero, WSS random processes with 

(l + cos(*f)) iV 

Sa-(w) = 2 ° J {M<avl and S N (u) = — 

where N > and uj > 0. (a) Find the transfer function H for the filter such that if the input 
process is Y, the output process, X, is such that X is the optimal linear estimator of X t based on 
(Y s : s G R). 

(b) Express the mean square error, o\ = E[(Xt — Xt) 2 ], as an integral in the frequency domain. 
(You needn't carry out the integration.) 

(c) Describe the limits of your answers to (a) and (b) as N — > 0. 
(c) Describe the limits of your answers to (a) and (b) as N — > oo. 

9.24 * Resolution of Wiener and Kalman filtering 

Consider the state and observation models: 

X n = FA n _i + W n 
Y n = H T X n + V n 
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where (W n : — oo < n < +00) and (V n : —00 < n < +00) are independent vector-valued random 
sequences of independent, identically distributed mean zero random variables. Let T*w and T,y 
denote the respective covariance matrices of W n and V n . (F, H and the covariance matrices must 
satisfy a stability condition. Can you find it? ) (a) What are the autocorrelation function Rx and 
crosscorrelation function Rxy? 

(b) Use the orthogonality principle to derive conditions for the causal filter h that minimizes 
E[\\ X n+ \ — ^2^0 h(j)Y n _j || 2 ]. (i.e. derive the basic equations for the Wiener-Hopf method.) 

(c) Write down and solve the equations for the Kalman predictor in steady state to derive an 
expression for h, and verify that it satisfies the orthogonality conditions. 



Chapter 10 

Martingales 



10.1 Conditional expectation revisited 

The general definiton of a martingale requires the general definition of conditional expectation. We 
begin by reviewing the definitions we have given so far. In Chapter 1 we reviewed the following 
elementary definition of i£LY|Y]. If X and Y are both discrete random variables, 

E[X\Y = i\ = Y,JP{X = ] \Y = i), 

3 

which is well defined if P{Y = i} > and either the sum restricted to j > or to j < is 
convergent. That is, i£LY|Y = i] is the mean of the conditional pmf of X given Y = i. Note that 
g(i) = E[X\Y = i] is a function of i. Let £?LY|Y] be the random variable defined by £?LY|Y] = g(Y). 
Similarly, if X and Y have a joint pdf, £7[X|Y = y] = J xfx\Y( x \y)dx = g{y) and £?LY|Y] = g(Y). 
Chapter 3 showed that ELY|Y] could be defined whenever .ELY 2 ] < oo, even if X and Y are 
neither discrete random variables nor have a joint pdf. The definition is based on a projection, 
characterized by the orthogonality principle. Specifically, if E[Y 2 ] < oo, then £?LY|Y| is the unique 
random variable such that: 

• it has the form g(Y) for some (Borel measurable) function g such that E[(g(Y) 2 ] < oo, and 

• E[(X - E[X\Y])f(Y)] = for all (Borel measurable) functions / such that E[(f(Y)) 2 } < oo. 

That is, £7LY|Y] is an unconstrained estimator based on Y, such that the error X — E[X\Y] is 
orthogonal to all other unconstrained estimators based on Y. By the orthogonality principle, £7LY|Y] 
exists and is unique, if differences on a set of probability zero are ignored. This second definition 
of £7LY|Y] is more general than the elementary definition, because it doesn't require X and Y to 
be discrete or to have a joint pdf, but it is less general because it requires that i?[X 2 ] < oo. 

The definition of i?[X|Y] given next generalizes the previously given definition in two ways. 
First, the definition applies as long as -E[|X|] < oo, which is a weaker requirement than .ELY 2 ] < oo. 
Second, the definition is based on having information represented by a a-algebra, rather than by a 
random variable. Recall that, by definition, a a-algebra V for a set Q is a set of subsets of Q such 
that: 

309 
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(a) fl£P, 

(b) if A G V then A c G X>, 

(c) if A, S G T> then ^45 G D, and more generally, if Ai, A2, ... is such that A$ G 2? for i > 1, then 

U^jA G V. 

In particular, the set of events, J-, in a probability space (£l,J-, P), is required to be a a-algebra. 
The original motivation for introducing T in this context was a technical one, related to the 
impossibility of extending P to be defined on all subsets of 0, for important examples such as 
= [0,1] and P{(a,b)} = b — a for all intervals (a, b). However, a-algebras are also useful for 
representing information available to an observer. We call D a sub-a-algebra of T if T> is a a- 
algebra such that T> c T . A random variable Z is said to be V '-measurable if {Z < c} <Z T> for 
all c. By definition, random variables are functions on Q that are ^"-measurable. The smaller the 
a-algebra D is, the fewer the set of T> measurable random variables. In practice, sub-a-algebras are 
usually generated by collections of random variables: 

Definition 10.1.1 The a -algebra generated by a collection of random variables (Yi : i G I), denoted 
by a(Yi : i G I), is the smallest a -algebra containing all sets of the form {Yi < c}. 1 The a -algebra 
generated by a single random variable Y is denoted by o~(Y), and sometimes as T . 

An equivalent definition would be that a(Yi : i G I) is the smallest a-algebra such that each Yi is 
measurable with respect to it. 

A sub-a-algebra of T represents knowledge about the probability experiment modeled by the 
probability space (Q, J-, P). In Chapter 3, the information gained from observing a random variable 
Y was modeled by requiring estimators to be random variables of the form g(Y), for a Borel 
measurable function g. An equivalent condition would be to allow any estimator that is a cr(Y)- 
measurable random variable. That is, as shown in a starred homework problem, if Y and Z are 
random variables on the same probability space, then Z = g(Y) for some Borel measurable function 
g if and only if Z is cr(Y) measurable. Using sub-a-algebras is more general, because some a-algebras 
on some probability spaces are not generated by random variables. Using a-algebras to represent 
information also works better when there is an uncountably infinite number of observations, such 
as observation of a continuous random process over an interval of time. But in engineering practice, 
the main difference between the two ways to model information is simply a matter of notation. 

Example 10.1.2 (The trivial a-algebra) Let (Cl,J-,P) be a probability space. Suppose X is a 
random variable such that, for some constant c , X(uj) = c for all a; G O. Then X is measurable 
with respect to the trivial a-algebra T> defined by T> = {0,0}. That is, constant random variables 
are {0, 0}-measurable. 

Conversely, suppose Y is a {0, 0}-measurable random variable. Select an arbitrary w„efl and 
let c = Y(lo ). On one hand, {lo : Y(u) < c} can't be empty for c > c , so {uj : Y(u) < c} = Q for 
c > c . On the other hand, {uj : Y(u) < c } doesn't contain uj for c < c , so {u> : Y(u) < c } = for 
c < c . Therefore, Y(u) = c for all to. That is, {0,$l}-measurable random variables are constant. 



1 The smallest one exists-it is equal to the intersection of all a-algebras which contain all sets of the form {Yi < c}. 
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Definition 10.1.3 If X is a random variable on (Q,J-,P) with finite mean and V is a sub-a- 
algebra of T , the conditional expectation of X given T>, E[X\T>], is the unique (two versions equal 
with probability one are considered to be the same) random variable on (ft,J-,P) such that 

(i) E[X\T>] is T> -measurable 

(ii) E[(X — E\X\T>\)Id\ = for all D £ D. (Here Id is the indicator function of D). 

We remark that a possible choice of D in property (ii) of the definition is D = £1, so i?LY|D] 
should satisfy E[X — E[X\T>]] = 0, or equivalently, since E[X] is assumed to be finite, E[X] = 
E[E[X\V]]. In particular, an implication of the definition is that £?LY|X>] also has a finite mean. 

Proposition 10.1.4 Definition 10.1.3 is well posed. Specifically, there exits a random variable 
satisfying conditions (i) and (ii), and it is unique. 

Proof. (Uniqueness) Suppose U and V are each D-measurable random variables such that 
E[(X - U)I D ] = and E[(X - V)I D ] = for all D eV.lt follows that E[(U - V)I D ] = E[(X - 
V)I D ]-E[(X-U)I D ] = for any DeV.A possible choice of D is {U > V}, so E[(U-V)I {U>V} ] = 
0. Since (U — V)Iijj > v} is nonnegative and is strictly positive on the event {U > V}, it must be 
that P{U > V} = 0. Similarly, P{U < V} = 0. So P{U = V} = 1. 

(Existence) Existence is first proved under the added assumption that P{X > 0} = 1. Let 
L 2 (T>) be the space of P-measurable random variables with finite second moments. Then D is 
a closed, linear subspace of L 2 (£t,T,P), so the orthogonality principle can be applied. For any 
n > 0, the random variable X An is bounded and thus has a finite second moment. Let X n be the 
projection of X A n onto L 2 (T>). Then by the orthogonality principle, X An — X n is orthogonal 
to any random variable in L 2 (T>). In particular, X An — X n is orthogonal to Id for any D G T>. 
Therefore, E[(X An- X n )I D ] = for all DgP. Equivalently, 

E[(XAn)I D ]=E[X n I D ]. (10.1) 

The next step is to take a limit asm oo. Since E[(X A n)lD] is nondecreasing in n for each 
DeP, the same is true of E[X n lD\- Thus, for any n > 0, E[(X n+ \ — X n )lD] > for any D G T>. 
Taking D = {X n+ \ — X n < 0} implies that P{X n+ i > X n } = 1. Therefore, the sequence (X n ) 
converges a.s., and we denote the limit by X^. We show that X^ satisfies the two properties, 
(i) and (ii), required of £?[X|X>]. First, X^ is D-measurable because it is the limit of a sequence 
of T>- measurable random variables. Secondly, for any D £ T>, the sequences or random variables 
(X A n)lD and X n lD are a.s. nondecreasing and nonnegative, so by the monotone convergence 
theorem (Theorem 11.6.6) and (10.1): 

E[XI D ] = lim E[(X A n)I D ] = Hm E[X n I D ] = E^Id]. 

n— >oo n^oo 

So property (ii), E[(X — X^Id] = 0, is also satisfied. Existence is proved in case P{X > 0} = 1. 
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For the general case, X can be represented as X = X + — X_, where X + and X- are nonnegative 
with finite means. By the case already proved, £^[X_|_|D] and i?[X_|D] exist, and, of course, they 
satisfy conditions (i) and (ii) in Definition 10.1.3. Therefore, with E[X\V] = E[X+\V] - E[X-\V], 
it is a simple matter to check that £7].XjP] also satisfies conditions (i) and (ii), as required. I 



Proposition 10.1.5 Let X and Y be random variables on (Q, J 7 , P) and let A and T> be sub-a- 
algebras of T . 

1. (Consistency with definition based on projection) If E[X 2 ] < oo and 

V = {g(y) '■ g is Borel measurable such that E[g(Y) 2 ] < oo}, then E[X\Y], defined as the 
MMSE projection of X onto V (also written as ILy(X)) is equal to E[X\a(Y)]. 

2. (Linearity) If E[X] and E[Y] are finite, then aE[X\V] + bE[Y\V] = E[aX + bY\V\. 

3. (Tower property) If E[X) is finite and A C V c T, then E[E[X\V)\A) = E[X\A). (In 
particular, E[E[X\V)) = E[X].) 

4- (Positivity preserving) If E[X] is finite and X > a.s. then E[X\T>] > a.s. 

5. (LI contraction property) E\\E\X\T)]\\ < E[\X\]. 

6. (LI continuity) If E[X n ] is finite for all n and E\\X n — X^W — > 0, then 
^[^[XnlPl-^XoolPlll-vO. 

7. (Pull out property) If X is T> -measurable and E\XY\ and E[Y] are finite, then E[XY\T>] = 
XE[Y\V\. 

Proof. (Consistency with definition based on projection) Suppose X and V are as in part 1. 
Then, by definition, E[X\Y] 6 V and E[(X - E[X\Y])Z] = for any Z G V. As mentioned above, 
a random variable has the form g(Y) if and only if it is cr(F)-measurable. In particular, V is simply 
the set of <r(F)-measurable random variables Z such that E[Z 2 ] < oo. Thus, £/[X|V] is <j(Y) 
measurable, and E[{X — E[X\Y])Z] = for any cr( Y)-measurable random variable Z such that 
E[Z 2 } < oo. As a special case, E[(X - E[X\Y))I D ) = for any D G a(Y). Thus, E[X\Y] satisfies 
conditions (i) and (ii) in Definition 10.1.3 of E[X\a(Y)}. So E[X\Y] = E[X\a(Y)}. 

(Linearity Property) (This is similar to the proof of linearity for projections, Proposition 3.2.2.) 
It suffices to check that the linear combination ai?[X|D] + 6£ , [F|D] satisfies the two conditions that 
define E[aX + bY\T>]. First, £'[X|!D] and i£[y|D] are both V measurable, so their linear combination 
is also ©-measurable. Secondly, if D G V, then E[(X - E[X\V\)I D \ = E[(Y - E[Y\V])I D ] = 0, 
from which if follows that 

E[(aX + bY - E[aX + bY\V\) I D ] = aE[{X - E[X\V])I D ] + bE[{Y - E[Y\V])I D ] = 0. 

Therefore, aE[X\V] + bE[Y\V] = E[aX + bY\V\. 

(Tower Property) (This is similar to the proof of Proposition 3.2.3, about projections onto nested 
subspaces.) It suffices to check that £7[.E[.Xj£>]|.4] satisfies the two conditions that define i£[X|.A]. 
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First, £?[£/[X|P]|^l] itself is a conditional expectation given A, so it is A measurable. Second, let 
D € A. Now X - E[E[X\V] = (X - E[X\V]) + (E[X\V] - E[E[X\V}), and (use the fact D € V): 
E[(X - E[X\V])I D ] and E[(E[X\V] - E[E[X\V))I D ) = 0. Adding these last two equations yields 
E[(X - E[E[X\)V)I D ] = 0. Therefore, E[E[X\V]\A] = E[X\A}. 

The proofs of linearity and tower property are nearly identical to the proofs of the same prop- 
erties for projections (Propositions 3.2.2 and 3.2.3 ) and are left to the reader. 

(Positivity preserving) Suppose E[X] is finite and X > a.s. Let D = {E[X\T>] < 0}. Then 
DeD because E[X\V] is ©-measurable. So E[E[X\V]I D ] = E[XI D ] > 0, while P{E[X\V]I D < 
0} = 1. Hence, P{E[X\V)I D = 0} = 1, which is to say that E[X\V] > a.s. 

(LI contraction property) (This property is a special case of the conditional version of Jensen's 
inequality, established in a starred homework problem. Here a different proof is given.) The 
variable X can be represented as X = X + — X_, where X + is the positive part of X and X- is 
the negative part of X, given by X + = X V and X- = {—X) V 0. Since X is assumed to have a 
finite mean, the same is true of X±. Moreover, .E[£'LY±|D]] = _E[X±], and by the linearity property, 
E[X\V] = E[X + \T>] - E[X_\T>}. By the positivity preserving property, E[X + \V] and E[X_\V] are 
both nonnegative a.s., so E[X + \V] +E[X-\T>] > \E[X+\T>]-E[X-\T>]\ a.s. (The inequality is strict 
for (jj such that both £J[X_|_|X>] and £/[X_|2?] are strictly positive.) Therefore, 

E[\X\] = E[X + ]+E[X_] 

= E[E[X + \V] + E[X_\V}} 

> E[\E[X+\V]-E[X-\V]\] 

= E[\E[X\V\], 

and the LI contraction property is proved. 

(LI continuity) Since for any n, |^oo| £ \X n \ + \X n — X^l, the hypotheses imply that Xoo 
has a finite mean. By linearity and the LI contraction property, £ , [|-E r [X n |!D] — -ELY^D]!] = 
£?[|£?[X n — Xoo|2?]|] < L[|E'[X n — XoolJI], which implies the LI continuity property. 

(Pull out property) The pull out property will be proved first under the added assumption that 
X and Y are nonnegative random variables. Clearly X.E[V|D] is T> measurable. Let D £ T>. It 
remains to show that 

E[XYI D ] = E[XE[Y\V\I D \. (10.2) 

If X has the form I Dl for D 1 £ V then (10.2) becomes E[YI DnDl ] = E[E[Y\D]I DnDl ], which 
holds by the definition of L[y|P] and the fact D D D\ £ T>. Equation (10.2) is thus also true if 
X is a finite linear combination of random variables of the form Id 1 , that is, if X is a simple Im- 
measurable random variables. Then X is the a.s. limit of a nondecreasing sequence of nonnegative 
simple random variables X n . Now (10.2) holds for X replaced by X n : 

E[X n YI D ] = E[X n E[Y\V]I D }. (10.3) 

Also, X n YI[) is a nondecreasing sequence converging to XYId a.s., and X n E[Y\V]l£> is a nonde- 
creasing sequence converging to XE[Y\T>]Ie> a.s. By the monotone convergence theorem, taking 
n — > oo on both sides of (10.3), yields (10.2). This proves the pull out property under the added 
assumption that X and Y are nonnegative. 
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In the general case, X = X + — X_, where X + = X V and X_ = {—X) V 0, and similarly Y = 
Y + -Y_. The hypotheses imply E[X±Y±] and E[Y±] are finite so that E[X±Y±\V] = X±E[Y±\V], 
and therefore 

E[X±Y±I D ] = E[X±E[Y±\V]I D ], (10.4) 

where in these equations, the sign on both appearances of X should be the same, and the sign on 
both appearances of Y should be the same. The left side of (10.2) can be expressed as a linear 
combination of terms of the form E[X±Y±Id]: 

E[XYI d ] = E[X + Y + I D ] - E[X + Y_I D ] - E[X_Y + I D ] + E[X_Y_I D }. 

Similarly, the right side of (10.2) can be expressed as a linear combination of terms of the form 
E[X±E[Y±\V]I D }. Therefore, (10.2) follows from (10.4). ■ 



10.2 Martingales with respect to nitrations 

A filtration of a a-algebra T is a sequence of sub-a-algebras T = (T n : n > 0) of J-, such that 
J- n C J- n +i f° r n > 0. If Y = (Y n : n > 0) or Y = (Y n : n > 1) is a sequence of random variables on 
{£l,T, P), the filtration generated by Y, often written as T Y = {T^ '■ n > 0), is defined by letting 
T„ = a(Yk : k < n). (If there is no variable Yq defined, we take Tq to be the trivial a-algebra, 
Tq = {0,Q}, representing no observations.) 

In practice, a filtration represents an sequence of observations or measurements. If the filtration 
is generated by a random process, then the information available at time n is represents observation 
of the random process up to time n. 

A random process (X n : n > 0) is adapted to a filtration T if X n is T n measurable for each 
n> 0. 

Definition 10.2.1 Let (Cl,J-, P) be a probability space with a filtration T = (T n : n > 0). Let 

Y = (Y n : n > 0) be a sequence of random variables adapted to T . Then Y is a martingale with 
respect to T if for all n > 0: 

(0) Y n is T n measurable (i.e. the process Y is adapted to T ) 

(i) E[\Y n \) < oo, 

(ii) E\Y n+ i\T n } = Y n a.s. 

Similarly, Y is a submartingale relative to T if (i) and (ii) are true and E[Y n+ i\^F n ] > Y n , a.s., 
and Y is a supermartingale relative to T if (i) and (ii) are true and E[Y n +i\F n ] < Y n a.s. 

Some comments are in order. Note the condition (ii) in the definition of a margingale implies 
condition (0), because conditional expectations with respect to a a-algebra are random variables 
measurable with respect to the a-algebra. 

Note that if Y = (Y n : n > 0) is a martingale with respect to a filtration T = {T n : n > 0), 
then Y is also a martingale with respect to the filtration generated by Y itself. Indeed, for each n, 
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Y n is J- n measurable, whereas J-^ is the smallest a-algebra with respect to which Y n is measurable, 
so J 7 ^ C T n . Therefore, the tower property of conditional expectation, the fact Y is a margtingale 
with respect to T ', and the fact Y n is T^ measurable, imply 

E[Y n+1 \Tl] = E[E\Y n+1 \r n ]\J%] = E\Y n \J%] = Y n . 

Thus, in practice, if Y is said to be a martingale and no filtration T is specified, at least Y is a 
martingale with respect to the filtration it generates. 

Note that if Y is a martingale with respect to a filtration J-, then for any n, k > 0, 

E\Y n+ k+i\T n \ = ElElYn+k+ilfn+klfn] = E\Y n+ k\T n \ 

Therefore, by induction on k for n fixed: 

E[X n+k \T n ] = X n , (10.5) 

for n, k > 0. 

Example 10.2.2 Suppose (Ui : i > 1) is a collection of independent random variables, each with 
mean zero. Let So — and for n > 1, S n = Y27=i Ui- Let T = (T n : n > 0) denote the filtration 
generated by S: T n = <j(So, . . . , S n ). Equivalently, T is the filtration generated by (Ui : i > 1): 
To = {0, 0,} and T n = a(So, . . . , S n ). for n > 1. Then 5 = (5 n : n > 0) is a martingale with respect 
to.F: 

-E't'S'n+ll^n] = EiUn+llfn] + ^[Sn+ll^n] = + S n = S n . 



Example 10.2.3 Suppose S = (S n : n > 0) and .F = (^ : n > 0) are defined as in Example 
10.2.2 in terms of a sequence of independent random variables U = (Ui : i > 1). Suppose in 
addition that Var(CZj) = a 2 for some finite constant a 2 . Finally, let M n = S 2 —na 2 for n > 0. Then 
M = (M n : n > 0) is a martingale relative to .F. Indeed, M is adapted to T . Since 5 n+ i = S n + ?7 n , 
we have M n+ i = M n + 2S n U n+ i + U 2 +1 - a 2 so that 

E[M n+1 \T n ] = E[M n \T n ] + 2S n E[U n \T n ]] + E[U 2 - a 2 \T n ]] 
= M n + 2S n E[U n ]+E[U 2 -a 2 } 
= M n 



Example 10.2.4 M(n) = e eSn / E[e eXl ] n in case X's are iid. 



Example 10.2.5 (Branching process) Let G n denote the number of individuals in the n gen- 
eration. Suppose the number of offspring per individual be represented by a random variable X. 
Select a > so that -E[a x ] = 1. Then, a Gn is a martingale. 
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Example 10.2.6 (Cumulative innovations process) Let M n = Y n — Ylk=o ^\Xk+\— ^fcl^bj ' " > ^n-i] 



Example 10.2.7 (Doob martingale) Y n = _E[<&|.F n ]. For example, n nodes, m edges in a graph, 
^ i = ^{ith edee exists} an< ^ ^ ne filtration is generated by Xq, X\, ■ ■ ■ . 



Definition 10.2.8 A martingale difference sequence {D n : n > 1) relative to a filtration T = {J- n : 
n > 0) is a sequence of random variables (D n : n > 1) such that 

(0) (D n : n > 1) is adapted to T (i.e. D n is F n -measurable for each n > 1) 

(i) E[\D n \] <oo for n> I 

(ii) -Ef-Dn+ij^n] = a.s. for all n > 0. 

Equivalently, {D n : n > 1) /ias i/ie /orm D n = M n — M n _i for n > 1, /or some (M n : n > 0) which 
is a martingale with respect to T . 

Definition 10.2.9 A random process {H n : n > 1) is said to be predictable with respect to a 
filtration T = (T n : n > 0) if H n is T n -\ measurable for all n > 1. (Sometimes this is called 
a one-step" predictable, because T n determines H one step ahead.) 

Example 10.2.10 Suppose {D n : n > 1) is a martingale difference sequence and {H^ : k > 1) 
is predictable, both relative to a filtration T = (T n : n > 0). We claim that the new process 
D = (D„ : n > 1) defined by D n = H n D n is also a martingale difference sequence with respect to 
T . Indeed, it is adapted, has finite means, and 

ElHn+iDn+^Tn] = i7 n+ ii?[D n _|_i|jF n ] = 0, 

where we pulled out the T n measurable random variable H n+ \ from the conditional expectation 
given T n . An interpretation is that D n is the net gain to a gambler if one dollar is staked on the 
outcome of a fair game in round n, and so H n D n is the net stake if H n dollars are staked on round 
n. The requirement that (H^ : k > 1) be predictable means that the gambler must decide how 
much to stake in round n based only on information available at the end of round n — 1. It would 
be an unfair advantage if the gambler already knew D n when deciding how much money to stake 
in round n. 

If the initial reserves of the gambler were some constant Mq, then the reserves of the gambler 
after n rounds would be given by: 

n 

M n = M + Y, H k D k 
fe=i 
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Then (M n : n > 0) is a margingale with respect to ^". The random variables are HkDk, 1 < k < n 
are orthogonal. Also, E[(H k D k ) 2 } = E[E[{H k D k ) 2 \r k -i}} = £[# 2 £[L> 2 |JF fe _i]]. Therefore, 

n 

E[(M n - M ) 2 ] = J2 EWkElDllFk-i}}- 

k=i 



10.3 Azuma-Hoeffding inequaltity 

One of the most simple inequalities for martingales is the Azuma-Hoeffding inequality. It is proven 
in this section, and applications to prove concentration inequalities for some combinatorial problems 
are given. 2 

Lemma 10.3.1 Suppose D is a random variable with E[D] = and P{\D — b\ < d} = 1 for some 
constant b. Then for any a G R, E[e aD ] < e (ad)2/2 . 

Proof. Since D has mean zero and D lies in the interval [b — d, b + d] with probability one, the 
interval must contain zero, so |6| < d. To avoid trivial cases we assume that \b\ < d. Since e ax is 
convex in x, the value of e ax for x G [b — d, b + d] is bounded above by the linear function that is 
equal to e ax at the endpoints, x = b ± d, of the interval: 

e» < x - b + d e "(b+d) + b + d - x e a(b-d)_ (1Q 6) 

Since D lies in that interval with probability one, (10.6) remains true if x is replaced by the random 
variable D. Taking expectations on both sides and using E[D] = yields 

E [e aD ] < t^-e*^ + h _±A e <b-d)_ (1Q 7) 



The proof is completed by showing that the right side of (10.7) is less than or equal to e^ ad ' < 2 for 
any \b\ < d. Letting u = ad and 9 = b/d, the inequality to be proved becomes f(u) < e u ' 2 , for 
uel and \8\ < 1, where 

((l- ) e u(i+e) + {1 + e)e u(-i+e)\ 
/(«) = ^ ^ j J ■ 

Taylor's formula implies that f(u) = /(0) +/'(0)u+ 2 ^ or sorne v m the interval with endpoints 
and u. Elementary, but somewhat tedious, calculations show that 

, (l-6» 2 )(e«-e-«) 



(1 - 6»)e« + (1 + 9)e~ u 



2 See McDiarmid survey paper 
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and 

4(1 -e 2 ) 



f"(u) 



[(1 - 0)e u + (1 + 6)e~ u Y 
1 



cosh 2 (u + /?) 



where (3 = £ln(^). Note that /(0) = /'(0) = 0, and /"(«) < 1 for all u e R. Therefore, 
/(it) < it 2 /2 for all it € R, as was to be shown. ■ 

Definition 10.3.2 A random process {B n : n > 0) is said to be predictable with respect to a 
filtration T = {J- n : n > 0) if Bq is a constant and B n is T n -\ measurable for all n > 1. 

Proposition 10.3.3 (Azuma-Hoeffding inequality with centering) Let (Y n : n > 0) 6e a martingale 
and {B n : n > 0) 6e a predictable process, both with respect to a filtration T = (T n : n > 0), suc/i 
t/iat P{|y„-|_i — S n +i| < d n } = 1 /or a^ n > 0. Then 

v2 



P{|F n - F | > A} < 2exp (~ 2E n_ rf2 ) 



2 EILi rff 

Proof. Let n > 0. The idea is to write Y n = Y n — Y n -\ + F n _i, to use the tower property of 
conditional expectation, and to apply Lemma 10.3.1 to the random variable Y n — Y n _\ for d = d n . 
This yields: 

E[e a{Y "- Yo) ] = E[E[e a V n - Yn - 1+Yn - 1 - Y °)\F n - 1 ]] 
= E[e a ( Yn - 1 - Yo) E[e a V' n - Yn - 1 |^ n -i]] 

< E rc t {Y n _ 1 -Y ),(ad n f/2_ 



Thus, by induction on n, 

E [ e a<y n -Y )] < e (« 2 /2)EILi^. 

The remainder of the proof is essentially the Chernoff inequality: 

P{Y n - Y > A} < E[e a ^- Y °-V] < e (« 2 /2)Er =1 4-«\ 
Finally, taking a to make this bound as tight as possible, i.e. a = tt^u — js, yields 

A 2 



P{Y n -Y Q >\}< exp 



"2ELi^ 



Similarly, P{Y n — Yq < — A} satisfies the same bound because the previous bound applies for (Y n 
replaced by (—Y n ), yielding the proposition. I 
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Definition 10.3.4 A function f of n variables x±, . . . ,x n is said to satisfy the Lipschitz condition 
with constant c if \f(xi,. . .,£„) - f(xi,... ,x n -i,yi, x i+i , . . .,x n )\ < c for any xi, . ..,x n , i, and 
Vi- 3 

Proposition 10.3.5 (McDiarmid's inequality) Suppose F = f(Xi, . . . ,X n ), where f satisfies the 
Lipschitz condition with constant c, and X\,...,X n are independent random variables. Then 

P{\F-E[F]\>X}<2eM-B)- 

Proof. Let (Z k : < k < n) denote the Doob martingale defined by Z k = E[F\T-jf], where, 
as usual, T k = a(X k : 1 < k < n) is the filtration generated by (X k ). Note that J-q is the trivial 
a-algebra {0,0}, corresponding to no observations, so Zq = E[F\. Also, Z n = F. In words, Z k is 
the expected value of F, given that the first k X's are revealed. 

For < k < n - 1, let 

9k(xi, ■■■, x k ,x k+ i) = E[f(xi, ..., x fc+ i, X k+2 , ..., X n )}. 

Note that Z k+ \ = g k (X\, . . . ,X k+ i). Since / satisfies the Lipschitz condition with constant c, the 
same is true of g k . In particular, for xi,...,x k fixed, the set of possible values (i.e. range) of 
g k (xi, . • • , Xk+i) as x k +i varies, lies within some interval (depending on x\, . . . , x k ) with length at 
most c. We define m k (xi, ■ ■ ■ , x k ) to be the midpoint of the smallest such interval: 

N su Px k+1 9k(xi,...,x k+1 ) +iid x g k (xi,...,x k +i) 
m k {xi, ...,x k ) = 



and let B k+ \ = m k (Xi, . . . ,X k ). Then B is a predictable process and \Z k+ \ — B k+ \\ < | with 
probability one. Thus, the Azuma-Hoeffding inequality with centering can be applied with di = | 
for all i, giving the desired result. I 



Example 10.3.6 Let V = {v±, . . . ,v n } be a finite set of cardinality n > 1. For each i,j with 
1 < 2 < j < n, suppose that Z,j is a Bernoulli random variable with parameter p, where < p < 1. 
Suppose that the Z's are mutually independent. Let G = (V, E) be a random graph, such that for 
i < j, there is an undirected edge between vertices V{ and Vj (i.e. v\ and Vj are neighbors) if and only 
if Zij = 1. Equivalently, the set of edges is E = {{i, j} : i < j and Zij = 1}. An independent set in 
the graph is a set of vertices, no two of which are neighbors. Let X = 1(G) denote the maximum of 
the cardinalities of all independent sets for G. Note that I is a random variable, because the graph 
is random. We shall apply McDiarmid's inequality to find a concentration bound for 1(G). Note 
that 1(G) = F((Zij : 1 < i < j < n)), for an appropriate function F. We could write a computer 
program for computing F, for example by cycling through all subsets of V, seeing which ones are 
independent sets, and reporting the largest cardinality of the independent. However, there is no 
need to be so explicit about what / is. Observe next that changing any one of the Z's would change 



3 Equivalently, f(x) — f(y) < cd,H{x,y), where dH{x,y) denotes the Hamming distance, which is the number of 
coordinates in which x and y differ. In the analysis of functions of a continuous variable, the Euclidean distance is 
used instead of the Hamming distance. 
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1(G) by at most one. In particular, if there is an independent set in a graph, and if one edge is 
added to the graph, then at most one vertex would have to be removed from the independent set for 
the original graph to obtain an independent set for the new graph. Thus, F satisfies the Lipschitz 
condition with constant c = 1. Thus, by McDiarmid's inequality with c = 1 and m = n(n — l)/2 
variables, 

P{|T-£[X]|>A}<2exp(-^-). 

More thought yields a tighter bound. For 1 < i < n, let AQ = (Zj^+i, Zj $+2, ■ ■ ■ ■> Zi,n)- I n words, for 
each i, AQ determines which vertices with index larger than i are neighbors of vertex m. Of course 
I is also determined by X±, . . . , X n . Moreover, if any one of the X's changes, I changes by at most 
one. That is, I can be expressed as a function of the n variables X\, . . . , X n , such that the function 
satisfies the Lipschitz condition with constant c = 1. Therefore, by McDiarmid's inequality with 
c = 1 and n variables, 4 

P{\l-E[l]\ > A}<2exp( ). 

n 

For example, if A = a^/n, we have 



P{\1- E[2}\ > a^/n} < 2exp(-2a 2 ) 

whenever n > 1, < p < 1, and a > 0. 

McDiarmid's inequality can similiarly be applied to obtain concentration inequalities for many 
other numbers associated with graphs, such as the size of a maximum matching (a matching is a 
set of edges, no two of which have a node in common), chromatic index (number of colors needed 
to color all edges so that all edges containing a single vertex are different colors), chromatic number 
(number of colors needed to color all vertices so that neighbors are different colors), minimum 
number of edges that need to be cut to break graph into two equal size components, and so on. 

10.4 Stopping times and the optional sampling theorem 

Let X = (X)~ : k > 0) be a martingale with respect to a filtration T = (T^ : k > 0). Note that 
E[X k+1 ] = ElElXu+^Fk] = E[X k }. So, by induction on n, E[X n ] = E[X ] for all n > 0. 

A useful interpretation of a martingale X = (AQ. : k > 0) is that X^ is the reserve (amount of 
money on hand) that a gambler playing a fair game at each time step, has after k time steps, if Xq 
is the initial reserve. (If the gambler is allowed to go into debt, the reserve can be negative.) The 
condition -E[Xfc+i|.Ffc] = X^ means that, given the knowledge that is observable up to time k, the 
expected reserve after the next game is equal to the reserve at time k. The equality -Epi^] = -EfATo] 
has the natural interpretation that the expected reserve of the gambler after n games have been 
played, is equal to the inital reserve Xq. 

What happens if the gambler stops after a random number, T, of games. Is it true that 
E[X T ] = E[X }7 



*Since X n is degenerate, we could use n — 1 instead of n, but it makes little difference. 
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Example 10.4.1 Suppose that X n = W x -\ \- W n where P{W k = 1} = P{W k = -1} = 0.5 for 

all k, and the W's are independent. Let T be the random time: 

f 3 if W 1 + W 2 + W 3 = 1 
I else 

Then Xy = 3 with probability 1/8, and Xt = otherwise. Hence, i£[Xr] = 3/8. 

Does example 10.4.1 give a realistic strategy for a gambler to obtain a strictly positive expected 
payoff from a fair game? To implement the strategy, the gambler should stop gambling after T 
games. However, the event {T = 0} depends on the outcomes Wi,W 2 , and W$. Thus, at time 
zero, the gambler is required to make a decision about whether to stop before any games are played 
based on the outcomes of the first thee games. Unless the gambler can somehow predict the future, 
the gambler will be unable to implement the strategy of stopping play after T games. 

Intuitively, a random time corresponds to an implementable stopping strategy if the gambler 
has enough information after n games to tell whether to play future games. That type of condition 
is captured by the notion of optional stopping time, defined as follows. 

Definition 10.4.2 An optional stopping time T relative to a filtration T = (T k '■ k > 0) is a 
random variable with values in 7L+ such that for any n > 0, {T < n} £ T n - 

The intuitive interpretation of the condition {T < n} £ T n is that, the gambler should have enough 
information by time n to know whether to stop by time n. Since u-algebras are closed under set 
complements, the condition in the definition of an optional stopping time is equivalent to requiring 
that, for any n > 0, {T > n} £ T n . This means that the gambler should have enough information 
by time n to know whether to continue gambling strictly beyond time n. 

Example 10.4.3 Let {X n : n > 0) be a random process adapted to a filtration T = (T n : n > 0). 

Let A be some fixed (Borel measurable) set, and let T = min{n > : X n £ ^4}. Then T is a 
stopping time relative to T . Indeed, {T < n} = {X k £ A for some k with < k < n}. So {T < n} 
is an event determined by (Xq, . . . , X n ), which is in T n because X is adapted to the filtration. 



Example 10.4.4 Suppose W\, W 2 , ■ ■ ■ are independent Bernoulli random variables with p = 0.5, 
modeling fair coin flips. Suppose that if a gambler stakes some money at the beginning of the 
n round, then if W n = 1, the gambler wins back the stake and an additional equal amount. If 
W n = 0, the gambler loses the money staked. Let X n denote the reserve of the gambler after n 
rounds. For simplicity, we assume that the gambler can borrow money as needed, and that the 
initial reserve of the gambler is zero. So Xq = 0. Suppose the gambler adopts the following strategy. 
The gambler continues playing until the first win, and in each round until stopping, the gambler 
stakes the amount of money needed to insure that the reserve at the time the gambler stops, is one 
doller. For example, the gambler initially borrows one dollar, and stakes it on the first outcome. 
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If W\ = 1 the gambler's reserve (money in hand minus the amount borrowed) is one dollar, and 
the gambler stops, so T = 1 and Xt = 1. Since no money is staked after time T, X^ = Xt for 
all k > T. If the gambler loses in the first round (i.e. W\ = 0), then X\ = — 1. In that case, 
the gambler keeps playing, and, next, borrows two more dollars and stakes them on the second 
outcome. If W2 = 1 the gambler's reserve is one dollar, and the gambler stops. So T = 2 and 
Xt = 1. If the gambler loses in the second round (i.e. Wi = 0), then X2 = —3. In that case, the 
gambler keeps playing, and, next, borrows four more dollars and stakes them on the third outcome, 
and so on. The random process (X n : n > 0) is a martingale. For this strategy, the number of 
rounds, T, that the gambler plays has the geometric distribution with parameter p = 0.5. Thus, 
E[T] = 2. In particular, T is finite with probability one. Thus, Xt = 1 a.s., while Xq = 0. Thus, 
E[Xt] / -EfXo]. This strategy does not require the gambler to be able to predict the future, and 
the gambler is always up one dollar after stopping. 

But don't run out and start playing this strategy, expecting to make money for sure. There is 
a catch-the amount borrowed can be very large. Indeed, let us compute the expectation of B, the 
total amount borrowed before the final win. If T = 1 then B = 1 (only the dollar borrowed in the 
first round is counted). If T = 2 then B = 3 (the first dollar in the first round, and two more in 
the second). In general, B = 2 T — 1. Thus, 



E[B] = J2( 2n ~ l ) P { T = n} = ^(2™ - l)2" n = ^(1 - 2" n ) = +c 

71=1 71=1 77 = 1 

That is, the expected amount of money the gambler will need to borrow is infinite. 



Proposition 10.4.5 If X is a martingale and T is an optional stopping time, relative to (£L,F, P), 
then E[Xtah] = -E[-X"o] f or an V n - 

Proof. Note that 

if T < n 



X TA(n+i) ~ Xtau - \ XA(n + l)-XAn if T > n 
= (lA(n + l)-lAn)I {T>n} 

Using this and the tower property of conditional expectation yields 

E[X TAin +i) ~ X TAn } = E[E[(X A (n + 1) - X A n)I {T>n} \F n \] 

= E[E[(XA(n+l)-XAn)\F n ]I {T>n} }=0 

because E[(X A (n + 1) - X A n)\f„] = 0. Therefore, E[X A (n + 1)] = E[X A n) for all n > 0. So 
by induction on n, E[Xtati] = E[Xq] for all n > 0. I 

The following corollary follows immediately from Proposition 10.4.5. 
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Corollary 10.4.6 If X is a martingale and T is an optional stopping time, relative to (Q,F,P), 
then E[Xq] = linin^oo E[Xtati\- In particular, if 

lim E[X TAn ] = E[X T ] (10.8) 

n— >oo 

then E[X T ] = E[X \. 

By Corollary 10.4.6, the trick to establishing £?LX/r] = -E[-Xo] comes down to proving (10.8). Note 
that Xtau — ->' Xt asm oo, so (10.8) is simply requiring the convergence of the means to the 
mean of the limit, for an a.s. convergent sequence of random variables. There are several different 
sufficient conditions for this to happen, involving conditions on the martingale X, the stopping 
time T, or both. For example: 

Corollary 10.4.7 If X is a martingale and T is an optional stopping time, relative to (£l,F,P), 
and if T is bounded (so P{T < n} = 1 for some n) then E[Xt] = E[Xq\. 

Proof. If P{T < n} = 1 then T A n = T with probability one, so E[Xt/\ti — Xt]- Therefore, the 
corollary follows from Proposition 10.4.5. I 



Corollary 10.4.8 If X is a martingale and T is an optional stopping time, relative to (Q,F,P), 
and if there is a random variable Z such that \X n \ < Z a.s. for all n, and E[Z] < oo, then 
E[X T ]=E[X }. 

Proof. Let e > 0. Since E[Z] < oo, there exists 6 > so that if A is any set with P(A) < 5, then 
E[ZIa] < e. Since Xtau —►' Xt, we also have Xr An -4 Xt- Therefore, if n is sufficiently large, 
P{\X T An ~ X T \ > e} < S. For such n, 

\Xtau - X T \ < e + \X TAn - XT\I{\x TAn -X T \>e} 

< e + 2\Z\I {lXTAn _ XTl>€} (10.9) 

Now E[\Z\I{\x TA „-x T \>e}] < e by the choice of 5 and n. So taking expectations of each side of (10.9) 
yields £ , [|Xy An — Xt\] < 3e. Both Xtau an d Xt have finite means, because both have absolute 
values less than or equal to Z, so 

\E[X TAn ] - E[X T )\ = \E[X TAn -X T )\< E[\X TAn - X T \) < 3e 

Since e was an arbitrary positive number, the corollary is proved. I 



Corollary 10.4.9 Suppose (X n : n > 0) is a martingale relative to (£l,F,P). Suppose 

(i) there is a constant c such that E[ \X n+ i — X n \ \F n ] < c for n > 0, 

(ii) T is stopping time such that E[T] < oo. 

Then E[Xt] = E[Xo\. If, instead, (X n : n > 0) is a submartingale relative to (Cl,F, P), satisfying 

(i) and (ii), then E[X T ] > E[X }. 
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Proof. Suppose (X n : n > 0) is a martingale relative to (Q,F,P), satisfying (i) and (ii). Looking 
at the proof of Corollary 10.4.8, we see that it is enough to show that there is a random variable Z 
such that E[Z] < +oo and |-Xr An | < Z for all n > 0. Let 

Z = | Xq J + | X\ — Xq I + • • • + I Xt — Xt- i | 

Obviously, |Xr An | < Z for all n > 0, so it remains to show that E[Z] < oo. But 



E[Z] 



E[\X ] + E 
E[\X \] + E 
E[\X \] + E 



/ JXj — Xi-i\I{i< T y 

oo 

El* 

oo 

Ei* 



4 = 1 



E 



Xi-l\I{i<T} I F 



i-1 



i=l 



I{i<T}E 



X.. 



i-l\ 



ft 



t-1 



i=l 



i=l 

= E[\X \] + cE[T] <oc 

The first statement of the Corollary is proved. If instead X is a submartingale, then a minor 
variation of Proposition 10.4.5 yields that E[XTAn] > -^[-^o]- The proof for the first part of the 
corollary, already given, shows that conditions (i) and (ii) imply that E[Xtati] ~^> E[Xt] as n — > oo. 
Therefore, E[X T ) > E[X ]. ■ 

Martingale inequalities offer a way to provide upper and lower bounds on the completion times 
of algorithms. The following example shows how a lower bound can be found for a particular game. 

Example 10.4.10 Consider the following game. There is an urn, initially with k\ red marbles 
and &?2 blue marbles. A player takes turns until the urn is empty, and the goal of the player is to 
minimize the expected number of turns required. At the beginning of each turn, the player can 
remove a set of marbles, and the set must be one of four types: one red, one blue, one red and 
one blue, or two red and two blue. After removing the set of marbles, a fair coin is flipped. If 
tails appears, the turn is over. If heads appears, then some marbles are added back to the bag, 
according to Table 10.1 Our goal will be to find a lower bound on £[T], where T is the number 
of turns needed by the player until the urn is empty. The bound should hold for any strategy the 
player adopts. Let X n denote the total number of marbles in the urn after n turns. If the player 
elects to remove only one marble during a turn (either red or blue) then with probability one half, 
two marbles are put back. Hence, for either set with one marble, the expected change in the total 
number of marbles in the urn is zero. If the player elects to remove two reds or two blues, then 
with probability one half, three marbles are put back into the urn. For these turns, the expected 
change in the number of marbles in the urn is -0.5. Hence, for any choice of u n (representing the 
decision of the player for the n + 1 th turn), 

E[X n+1 \X n ,u n ] >X n -0.5 on {T > n} 
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Table 10.1: Rules of the marble game 


Set removed 


Set returned to bag on "heads" 


one red 


one red and one blue 


one blue 


one red and one blue 


two reds 


three blues 


two blues 


three reds 



That is, the drift of X n towards zero is at most 0.5 in magnitude, so we suspect that no strategy 
can empty the urn in average time less than (k\ + &2)/0.5. In fact, this result is true, and it is now 
proved. Let M n = X nfS T + ^~ • By the observations above, M is a submartingale. Furthermore, 
\M n+ i - M n \ < 2. Either E[T] = +oo or E[T] < oo. If E[T] = +oo then the inequality to be 
proved, E[T] > 2(ki + k 2 ), is trivially true, so suppose E[T] < 00. Then by Corollary 10.4.9, 
E[M T ] > E[M ] = hi + k 2 - Also, M T = \ with probability one, so E[T] > 2(ki + k 2 ), as claimed. 



10.5 Notes 

Material on Azuma-Hoeffding inequality and McDiarmid's method can be found in McDiarmid's 
tutorial article [7]. 

10.6 Problems 

10.1 Two martingales associated with a simple branching process 

Let Y = (Y n : n > 0) denote a simple branching process. Thus, Y n is the number of individuals in 
the 11 th generation, Yq = 1, the numbers of offspring of different individuals are independent, and 
each has the same distribution as a random variable X. 

(a) Identify a constant 6 so that G n = -A is a martingale. 

(b) Let £ denote the event of eventual extinction, and let a = P{£}- Show that P(£\Yq, . . . , Y n ) = 
a Yn . Thus, M n = a Yn is a martingale. 

(c) Using the fact i£[Mi] = -E[Mo], find an equation for a. (Note: Problem 4.29 shows that a is 
the smallest positive solution to the equation, and a < 1 if and only if E[X] > 1.) 

10.2 A covering problem 

Consider a linear array of n cells. Suppose that m base stations are randomly placed among the 
cells, such that the locations of the base stations are independent, and uniformly distributed among 
the n cell locations. Let r be a positive integer. Call a cell i covered if there is at least one base 
station at some cell j with \i — j\ < r — 1. Thus, each base station (unless those near the edge of 
the array) covers 2r — 1 cells. Note that there can be more than one base station at a given cell, 
and interference between base stations is ignored. 
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(a) Let F denote the number of cells covered. Apply the method of bounded differences based on 
the Azuma-Hoeffding inequality to find an upper bound on P{\F — E[F]\ > 7}. 

(b) (This part is related to the coupon collector problem and may not have anything to do with 
martingales.) Rather than fixing the number of base stations, m, let X denote the number of base 
stations needed until all cells are covered. In case r = 1 we have seen that P{X > n In n + en} — > 
exp(— e~ c ) (the coupon collectors problem). For general r > 1, find g\{r) and #2( r ) to so that for 
any e > 0, P{X > (<72(r) + e)nlnn} — > and P{X < (<?i(r) — e)nlnn} — > 0. (Ideally you can find 
<?i — 92, but if not, it'd be nice if they are close.) 

10.3 Doob decomposition 

Suppose X = {Xk : k > 0) is an integrable (meaning E^X/d] < 00 for each k) sequence adapted to 
a filtration T = {Tk '■ k > 1). (a) Show that there is sequence B = {Bk : k > 0) which is predictable 
relative to T (which means that Bq is a constant and Bk is Tk-i measurable for k > 1) and a mean 
zero martingale M = (M& : k > 0), such that Xk = Bk + M^ for all k. (b) Are the sequences B 
and M uniquely determined by X and .F? 

10.4 On uniform integrability 

(a) Show that if {Xi : i £ /) and (1^ : i E I) are both uniformly integrable collections of random 
variables with the same index set /, then (Zj : i £ I), where Z{ = X{ + Y\ for all i, is also a 
uniformly integrable collection, (b) Show that a collection of random variables {X{ : i £ I) is 
uniformly integrable if and only if there exists a convex increasing function ip : IR_|_ — > IR + with 
linic^oo ^^^ = +00 and a constant K, such that .©[(^(A'j)] < K for all i £ i". 

10.5 Stopping time properties 

(a) Show that if S and T are stopping times for some filtration T ', then SAT, S V T, and S + T, 
are also stopping times. 

(b) Show that if J- is a filtration and X = {X^ '■ k > 0) is the random sequence defined by 
Xk = I(T<k\ f° r some random time T with values in Z+, then T is a stopping time if and only if 
X is ^"-adapted. 

(c) If T is a stopping time for a filtration .F, recall that Tt is the set of events A such that 
A fl {T < n} £ T n for all n. (Or, for discrete time, the set of events A such that A n {T = n} £ J^ 
for all n.) Show that (i) Tt is a <r-algebra, (ii) T is Tt measurable, and (iii) if X is an adapted 
process then Xt is Tt measurable. 

10.6 A stopped random walk 

Let W\, W2, ... be a sequence of independent, identically distributed mean zero random variables. 
To avoid triviality, assume P{W\ = 0} / 0. Let Sq = and S n = W\ + . . . W n for n > 1. Fix 
a constant c > and let r = min{n > : \S n \ > c}. The goal of this problem is to show that 
E[S T ] = 0. 

(a) Show that SfS 1 ,-] = if there is a constant D so that P{|Wj| > D} = 0. (Hint: Invoke a version 
of the optional stopping theorem). 

(b) In view of part (a), we need to address the case that the Ws are not bounded. Let W n = 
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W„ if \W n \ < 2c 

a if W n > 2c where the constants a and b are selected so that a > 2c, b > 2c, and 
-b if W„ < -2c 

^[Wj] — 0. Note that if r < n and if W^ / W n , then r — n. Thus, r defined above also 
satisfies r = min{n > : \S n \ > c}. Let a 2 = Var(I4 / j). Let S n = W\ + . . . W n for n > and let 
M n = S 2 — n<r 2 . Show that M is a martingale. Hence, E[M T /\ n ] = for all n. Conclude that 
E[t] < oo 

(c) Show that ^[SV] = 0. (Hint: Use part (b) and invoke a version of the optional stopping 
theorem) . 

10.7 Bounding the value of a game 

Consider the following game. Initially a jar has a red marbles and b blue marbles. On each turn, 
the player removes a set of marbles, consisting of either one or two marbles of the same color, and 
then flips a fair coin. If heads appears on the coin, then if one marble was removed, one of each 
color is added to the jar, and if two marbles were removed, then three marbles of the other color 
are added back to the jar. If tails appears, no marbles are added back to the jar. The turn is then 
over. Play continues until the jar is empty after a turn, and then the game ends. Let r be the 
number of turns in the game. The goal of the player is to minimize E[t]. A strategy is a rule to 
decide what set of marbles to remove at the beginning of each turn. 

(a) Find a lower bound on E[t] that holds no matter what strategy the player selects. 

(b) Suggest a strategy that approximately minimizes E[t], and for that strategy, find an upper 
bound on E[t]. 

10.8 On the size of a maximum matching in a random bipartite graph 

Given 1 < d < n, let U = {u\, . . . , u n } and V = {vi, . . . , v n } be disjoint sets of cardinality n, and let 
Gbea bipartite random graph with vertex set U U V, such that if Vi denotes the set of neighbors 
of Ui, then V±, . . . ,V n are independent, and each is uniformly distributed over the set of all (^) 
subsets of V of cardinality d. A matching for G is a subset of edges M such that no two edges in 
M have a common vertex. Let Z denote the maximum of the cardinalities of the matchings for G. 

(a) Find bounds a and b, with < a < b < n, so that a < E[Z] < b. 

(b) Give an upper bound on P{\Z — E[Z]\ > "f^/n}, for 7 > 0, showing that for fixed d, the 
distribution of Z is concentrated about its mean as n — > 00. 

(c) Suggest a greedy algorithm for finding a large cardinality matching. 

10.9 * Equivalence of having the form g(Y) and being measurable relative to the sigma 
algebra generated by Y. 

Let Y and Z be random variables on the same probability space. The purpose of this problem is to 
establish that Z = g(Y) for some Borel measurable function g if and only if Z is o(Y) measurable. 
("only if" part) Suppose Z = g(Y) for a Borel measurable function g, and let c £ R. It 
must be shown that {Z < c} £ o~(Y). Since g is a Borel measurable function, by definition, 
A = {y : g(y) < c} is a Borel subset of R. (a) Show that {Z < c} = {Y £ A}, (b) Using the 
definition of Borel sets, show that {Y £ ^4} £ o~(Y) for any Borel set A. The "only if part follows. 
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("if" part) Suppose Z is cr(Y) measurable. It must be shown that Z has the form g(Y) for 
some Borel measurable function g. (c) Prove this first in the special case that Z has the form of 
an indicator function: Z = Is, for some event B, which satisfies B £ o~(Y). (Hint: Appeal to the 
definition of a(Y).) (d) Prove the "if part in general. (Hint: Z can be written as the supremum 
of a countable set of random variables, with each being a constant times an indicator function: 
Z = sup n qnl{z<q n \i where q±,q2, ■ ■ ■ is an enumeration of the set of rational numbers.) 

10.10 * Regular conditional distributions 

Let X be a random variable on (£l,J-, P) and let Pbea sub-a-algebra of T . A conditional prob- 
ability such as P{X < c\T>) for a fixed constant c can sometimes have different versions, but any 
two such versions are equal with probability one. Roughly speaking, the idea of regular conditional 
distributions, defined next, is to select a version of P{X < c\T>) for every real number c so that, 
as a function of c for u fixed, the result is a valid CDF (i.e. nondecreasing, right- continuous, with 
limit zero at — oo and limit one at +oo.) The difficulty is that there are uncountably many choices 
of c. Here is the definition. A regular conditional CDF of X given D, denoted by Fx\x>{c\w), is a 
function of (c,w)elxfl such that: 

(1) for each eel fixed, Fx\x>(c\^) is a P measurable function of u>, 

(2) for each u fixed, as a function of c, Fx\x>(c\to) is a valid CDF, 

(3) for any ceR, Fx\v{ c \^) is a version of P{X < c\T>). 

The purpose of this problem is to prove the existence of a regular conditional CDF. For each 
rational number q, let &(q) = P(X < q\T>). That is, for each rational number q, we pick $(</) to be 
one particular version of P(X < q\T>). Thus, &(q) is a random variable, and so we can also write 
it at as &(q,Ld) to make explicit the dependence on to. By the positivity preserving property of 
conditional expectations, P{&(q) > $(q')} — if q < q'. Let {q\, q2, ■ ■ •} denote the set of rational 
numbers, listed in some order. The event iV defined by 

N = n n , m:gn<qm {$(g n ) > $(q m ).} 

thus has probability zero. Modify &(q,uj) for u £ iV by letting &(q,ui) = F (q) for u £ N and all 
rational q, where F is an arbitrary, fixed CDF. Then for any c £ IR and u £ Q, let 

4>(c, uj) = inf $(q,Lo) 

q>c 

Show that $ so defined is a regular, condtional CDF of X given T>. 

10.11 * An even more general definition of conditional expectation, and the condi- 
tional version of Jensen's inequality 

Let X be a random variable on (£l,J-, P) and let D be a sub-<r-algebra of T. Let F x \ti(c\uj) be a 
regular conditional CDF of X given T>. Then for each u, we can define E'fXlD] at uj to equal the 
mean for the CDF Fx\t>(c\uj) : c £ IR}, which is contained in the extended real line KU { — oo, +oo}. 
Symbolically: E'LYlD^u;) = J R cF x \v(dc\uj). Show that, in the special case that -E'U-X'I] < oo, this 
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definition is consistent with the one given previously. As an application, the following conditional 
version of Jensen's inequality holds: If <f> is a convex function on R, then E[(f>(X)\D] > (f>(E[X\D]) 
a.s. The proof is given by applying the ordinary Jensen's inequality for each uj fixed, for the regular 
conditional CDF of X given V evaluated at u. 
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Chapter 11 

Appendix 

11.1 Some notation 

The following notational conventions are used in these notes. 









AB 








Ac B 








A-B 








oo 

1=1 








oo 




a Vb 




max{a, 6} 

a Ab 
a+ 

Ia(x) 


(a 


,b) = 


{x : 


: a < x < b} 


[a 


,b) = 


{x : 


: a < x < b} 



complement of A 
Ar\B 

any element of A is also an element of B 
AB C 

{a : a £ Ai for some i} 

{a : a £ Ai for all i} 

a if a > b 
b if a < b 

min{a, 6} 

a V = max{a, 0} 
1 if x £ A 
else 

(a, 6] = {x : a < x < b} 

[a, b] = {x : a < x < b} 
Z — set of integers 
.+ — set of nonnegative integers 
M. — set of real numbers 
.+ — set of nonnegative real numbers 
C = set of complex numbers 
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A\ x • • • x A n = {(oi, . . . , a n ) : a>i <E Ai for 1 < i < n} 
A n = A x • • • x A 

V v ' 

n times 
|_ij = greatest integer n such that n < t 

\t] = least integer n such that n > t 
A = expression — denotes that A is defined by the expression 

All the trigonometric identities required in these notes can be easily derived from the two 
identities: 

cos(a + b) = cos(a) cos(6) — sin(a) sin(6) 
sin(a + b) = sin(a) cos(fr) + cos(a) sin(6) 

and the facts cos(— a) = cos(a) and sin(— b) = — sin(6). 

A set of numbers is countably infinite if the numbers in the set can be listed in a sequence 
Xi : i = 1,2,.... For example, the set of rational numbers is countably infinite, but the set of all 
real numbers in any interval of positive length is not countably infinite. 

11.2 Convergence of sequences of numbers 

We begin with some basic definitions. Let (x n ) = (aq, X2, ■ ■ •) and (y n ) = (yi,y2, • • •) be sequences 
of numbers and let x be a number. By definition, x n converges to x as n goes to infinity if for each 
e > there is an integer n e so that | x n — x |< e for every n > n t . We write linin^oo x n = x to 
denote that x n converges to x. 

Example 11.2.1 Let x n = -fe- . Let us verify that lim n ^oo x n = 0. The inequality | x n |< e 
holds if 2n + 4 < e(n 2 + 1). Therefore it holds if 2n + 4 < en 2 . Therefore it holds if both 2n < fn 2 



and 4 < |n 2 . So if n e 



max 



{m 



then n > n € implies that | x n \ < e. So linin^oo x n = 0. 



By definition, (x n ) converges to +oo as n goes to infinity if for every K > there is an integer 
rix so that x n > K for every n > uk- Convergence to — oo is defined in a similar way 1 For 
example, n 3 — > oo as n — > c>o and n 3 — 2n 4 — > — oo as n — > oo. 

Occasionally a two-dimensional array of numbers (a m)n : m > l,n > 1) is considered. By 
definition, a m ^ n converges to a number a* as m and n jointly go to infinity if for each e > there 
is n e > so that | a m ^ n — a* |< e for every m,n > n e . We write lim^j^^oo a mi „ = a to denote that 
o-m,n converges to a as m and n jointly go to infinity. 

Theoretical Exercise Let a m ^ n = lifm = n and a m ^ n = if m / n. Show that linin^oo (linim^oo a m?n ) 
lim m ^ 00 (lim n ^ 00 o mn ) = but that lim mi?woo a mjn does not exist. 



Some authors reserve the word "convergence" for convergence to a finite limit. When we say a sequence converges 
to +oo some would say the sequence diverges to +oo. 
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Theoretical Exercise Let a m ^ n = ^infm ~1 • Show that linim^oo a m ^ n does not exist for any n 
and linin^oo a m ,n does not exist for any m, but lim m)n ^oo a m ,n = 0. 

Theoretical Exercise If linim^^oo a mn = a* and linim^oo a mn = b n for each n, then linin^oo b n = 
a . 

The condition lim mjrwoo a mj „ = a* can be expressed in terms of convergence of sequences 
depending on only one index (as can all the other limits discussed in these notes) as follows. Namely, 
lim m) „^ 00 a m ^ n = a* is equivalent to the following: lim^oo a mkt n k = a* whenever ((m^, n^) : k > 1) 
is a sequence of pairs of positive integers such that m^ — > oo and n^ — > oo as A; — > oo . The condition 
that the limit lim mjn ^ 00 a m ^ n exists, is equivalent to the condition that the limit lim^oo a mk >nk 
exists whenever ((mfc,nfc) '■ k > 1) is a sequence of pairs of positive integers such that nik — > oo 
and rifc — > oo as A; — > oo. 2 

A sequence 01,02, .. . is said to be nondecreasing if Oj < Oj for i < j. Similarly a function / 
on the real line is nondecreasing if f(x) < /(y) whenever x < y. The sequence is called strictly 
increasing if a« < aj for i < j and the function is called strictly increasing if f(x) < f{y) whenever 
x < y. 3 A strictly increasing or strictly decreasing sequence is said to be strictly monotone, and a 
nondecreasing or nonincreasing sequence is said to be monotone. 

The sum of an infinite sequence is defined to be the limit of the partial sums. That is, by 
definition, 



00 n 

y yj~ = x means that lim > y^ 

fe=l fe=l 



Often we want to show that a sequence converges even if we don't explicitly know the value of the 
limit. A sequence (x n ) is bounded if there is a number L so that \ x n \< L for all n. Any sequence 
that is bounded and monotone converges to a finite number. 

Example 11.2.2 Consider the sum YlkLi k~ a f° r a constant a > 1. For each n the n partial 
sum can be bounded by comparison to an integral, based on the fact that for k > 2, the k th term 
of the sum is less than the integral of x~ a over the interval [k — 1, k]: 

Tk~ a <l+ x~ a dx = 1+ ,^ <i + _L_ = _^_ 



fc=l 



[a — i) a — 1 a 



The partial sums are also monotone nondecreasing (in fact, strictly increasing). Therefore the sum 
SfcLi k~ a ex ists and is finite. 



2 We could add here the condition that the limit should be the same for all choices of sequences, but it is auto- 
matically true. If if two sequences were to yield different limits of a mky „ k , a third sequence could be constructed by 
interleaving the first two, and a mkt „ h wouldn't be convergent for that sequence. 

3 We avoid simply saying "increasing," because for some authors it means strictly increasing and for other authors 
it means nondecreasing. While inelegant, our approach is safer. 
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A sequence (x n ) is a Cauchy sequence if lim m) „^ 00 | x m — x n |= 0. It is not hard to show that 
if x n converges to a finite limit x then (x n ) is a Cauchy sequence. More useful is the converse 
statement, called the Cauchy criteria for convergence, or the completeness property of R: If {x n ) 
is a Cauchy sequence then x n converges to a finite limit as n goes to infinity. 

Example 11.2.3 Suppose {x n : n > 1) is a sequence such that ^^ |#i+i — Xj| < oo. The Cauchy 
criteria can be used to show that the sequence {x n : n > 1) is convergent. Suppose 1 < m < n. 
Then by the triangle inequality for absolute values: 

n-l 



%n x m\ S / l^-i+1 3-i 



or, equivalently, 



^n -^m 



2-m _ 



(11.1) 



n— 1 m— 1 

El _ I _ V I _ I 

|2?i+l x i\ / \ x i+l 3-i| 

i=l i=l 

Inequality (11.1) also holds if 1 < n < m. By the definition of the sum, X^£i l^i+i ~~ x i\: both sums 
on the right side of (11.1) converge to YliLi \ x i+i ~ x i\ asm,m oo, so the right side of (11.1) 
converges to zero asm,m oo. Thus, {x n ) is a Cauchy sequence, and it is hence convergent. 



Theoretical Exercise 

1. Show that if lim^oo x n = x and lim n ^oo y n = y then lim^oo x n y n = xy. 

2. Find the limits and prove convergence as n — > oo for the following sequences: 

(a) x n = n2+1 , (b) y n = Y^ppi (c) z„ = 2^ A , =2 fciogfc 

The minimum of a set of numbers, A, written min^4, is the smallest number in the set, if there 
is one. For example, min{3, 5, 19, —2} = —2. Of course, min A is well defined if A is finite (i.e. has 
finite cardinality). Some sets fail to have a minimum, for example neither {1, 1/2, 1/3, 1/4, . . .} nor 
{0, — 1, —2, . . .} have a smallest number. The infimum of a set of numbers A, written inf A, is the 
greatest lower bound for A. If A is bounded below, then inf A = max{c : c < a for all a G ^4}. For 
example, inf{l, 1/2, 1/3, 1/4, . . .} = 0. If there is no finite lower bound, the infimum is — oo. For 
example, inf{0, — 1, — 2, . . .} = — oo. By convention, the infimum of the empty set is +oo. With 
these conventions, if A C B then inf A > inf B. The infimum of any subset of R exists, and if min A 
exists, then min^ = inf A, so the notion of infimum extends the notion of minimum to all subsets 
of R. 

Similarly, the maximum of a set of numbers A, written max A, is the largest number in the set, 
if there is one. The supremum of a set of numbers A, written sup A, is the least upper bound for 
A. We have sup A = — inf{— a : a G ^4}. In particular, sup A = +oo if A is not bounded above, and 
sup0 = — oo. The supremum of any subset of R exists, and if maxA exists, then max A = sup A, 
so the notion of supremum extends the notion of maximum to all subsets of R. 
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The notions of infimum and supremum of a set of numbers are useful because they exist for any 
set of numbers. There is a pair of related notions that generalizes the notion of limit. Not every 
sequence has a limit, but the following terminology is useful for describing the limiting behavior of 
a sequence, whether or not the sequence has a limit. 

Definition 11.2.4 The liminf (also called limit inferior,) of a sequence (x n : n > 1), is defined by 

liminf x n = lim [infjxfc : k > n}] , (H-2) 

n—yoo n^oo 

and the limsup (also called limit superior,) is defined by 

limsupx n = lim [supjxfc : k > n}] , (11.3) 



n^oo 



The possible values of the liminf and limsup of a sequence are R U {— oo, +oo}. 

The limit on the right side of (11.2) exists because the infimum inside the square brackets is 
monotone nondecreasing in n. Similarly, the limit on the right side of (11.3) exists. So every 
sequence of numbers has a liminf and limsup. 

Definition 11.2.5 A subsequence of a sequence (x n : n > 1) is a sequence of the form (x^ : i > 1), 
where k±, fe, . . . is a strictly increasing sequence of integers. The set of limit points of a sequence is 
the set of all limits of convergent subsequences. The values — oo and +oo are possible limit points. 

Example 11.2.6 Suppose y n = 121 — 25n 2 for n < 100 and y n = 1/n for n > 101. The liminf and 
limsup of a sequence do not depend on any finite number of terms of the sequence, so the values 
of y n for n < 100 are irrelevant. For all n > 101, infjxfc : k > n} = inf{l/n, l/(n + 1), . . .} = 0, 
which trivially converges to as in oo. So the liminf of (y n ) is zero. For all n > 101, supjxfc : 
k > n} = sup{l/n, l/(n + 1), . . .} = -, which converges also to at n — > oo. So the limsup of (y n ) 
is also zero. Zero is also the only limit point of (y n ). 



Example 11.2.7 Consider the sequence of numbers (2,-3/2,4/3,-5/4,6/5,...), which we also 

write as (x n : n > 1) such that x n = — — ^~ . The maximum (and supremum) of the sequence is 

2, and the minimum (and infimum) of the sequence is —3/2. But for large n, the sequence alternates 
between numbers near one and numbers near minus one. More precisely, the subsequence of odd 
numbered terms, (x2i-i : i > 1)> converges to 1, and the subsequence of even numbered terms, 
(%2i '■ i > 1}) has limit +1. Thus, both 1 and -1 are limit points of the sequence, and there aren't 
any other limit points. The overall sequence itself does not converge (i.e. does not have a limit) 
but liminfn^oo x n = — 1 and limsup,^^ x n = +1. 

Some simple facts about the limit, liminf, limsup, and limit points of a sequence are collected 
in the following proposition. The proof is left to the reader. 
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Proposition 11.2.8 Let (x n : n > 1) denote a sequence of numbers. 

1. The condition liminfn^oo x n = Xqo is equivalent to the following: 
for any 7 < Xqo, x n > 7 for all sufficiently large n. 

2. The condition Ywasn^n^^ x n = Xoo is equivalent to the following: 
for any 7 > x^, x n < 7 for all sufficiently large n. 



3. lim inf n ^oo x n < lim sup n _ 



.00 -^n- 



4- linin^oo x n exists if and only if the liminf equals the limsup, and if the limit exists, then the 
limit, liminf, and limsup are equal. 

5. linin^oo x n exists if and only if the sequence has exactly one limit point, x*, and if the limit 
exists, it is equal to that one limit point. 

6. Both the liminf and limsup of the sequence are limit points. The liminf is the smallest limit 
point and the limsup is the largest limit point (keep in mind that —00 and +00 are possible 
values of the liminf, limsup, or a limit point). 

Theoretical Exercise 

1. Prove Proposition 11.2.8 

2. Here's a more challenging one. Let r be an irrational constant, and let x n = nr — \nr\ for 
n > 1. Show that every point in the interval [0, 1] is a limit point of {x n : n > 1). (P. Bohl, 
W. Sierpinski, and H. Weyl independently proved a stronger result in 1909-1910: namely, 
the fraction of the first n values falling into a subinterval converges to the length of the 
subinterval.) 

11.3 Continuity of functions 

Let / be a function on R n for some n, and let x G R n . The function has a limit y at x , and such 
situation is denoted by linx^^.^ f{x) = y, if the following is true. Given e > 0, there exists 8 > so 
that J f{x) — y |< e whenever < ||x — x \\ < S. This convergence condition can also be expressed 
in terms of convergence of sequences, as follows. The condition lim x ^ a;o f{x) = y is equivalent to 
the condition f{x n ) — > y for any sequence x\,X2, ■ ■ ■ from M. n — x such that x n — > x . 

The function / is said to be continuous at x , or equivalently, x is said to be a continuity 
point of /, if rinxr^a^ f{x) = f{x ). In terms of sequences, / is continuous at x if f{x n ) — > f{x ) 
whenever x\, X2, ■ ■ ■ is a sequence converging to x . The function / is simply said to be continuous 
if it is continuous at every point in R n . 

Let n = 1, so consider a function / on R, and let x G R. The function has a right-hand limit 
y at x , and such situation is denoted by f(x +) = y or lim x \ Xo f(x) = y, if the following is true. 
Given e > 0, there exists 5 > so that | f{x) — y \< e whenever < x — x < S. Equivalently, 
f{x Q +) = y if f(%n) -^ y for any sequence x\,X2,.-. from (x ,+cx)) such that x n — > x . The 
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left-hand limit f(x —) = \vca. x s Xo f(x) is defined similarly. If / is monotone nondecreasing, then 
the left-hand and right-hand limits exist, and f(x —) < f(x ) < f(x +) for all x . 

A function / is called right- continuous at x if f(x ) = f(x +). A function / is simply called 
right-continuous if it is right- continuous at all points. 

Definition 11.3.1 A function f on a bounded interval (open, closed, or mixed) with endpoints 
a < b is piecewise continuous, if there exist n > 1 and a = to < t\ < ■ ■ ■ < t n = b, such that, for 
1 < k < n: f is continuous over (£fc_i,£fc) and has finite limits at the endpoints of (tfc-i,t&)- 
More generally, if T is all of R or an interval in R, / is piecewise continuous over T if it is 
piecewise continuous over every bounded subinterval ofT. 

11.4 Derivatives of functions 

Let / be a function on R and let x £ R. Then / is differ entiable at x if the following limit exists 
and is finite: 

Urn M^IM. 

x^x x — X 

The value of the limit is the derivative of / at x , written as f'{x ). In more detail, this condition 
that / is differentiable at x means there is a finite value f'{x ) so that, for any e > 0, there exists 
5 > 0, so that 

1/0*0-/(0 



X — Xr 



/'(< 



<5 



whenever < \x — x \ < e. Alternatively, in terms of convergence of sequences, it means there is a 
finite value f'(x ) so that 

,. f\ x n) ~ J\X ) j.u s 

hm = / (xo) 

n^oo x n — X 

whenever (x n : n > 1) is a sequence with values in R — {x } converging to x . The function / is 
differentiable if it is differentiable at all points. 

The right-hand derivative of / at a point x , denoted by D + f{x ), is defined the same way as 
f'(x ), except the limit is taken using only x such that x > x . The extra condition x > x is 
indicated by using a slanting arrow in the limit notation: 

n -/ n r f(x) - f(x ) 

D+f(x ) = hm . 

Similarly, the left-hand derivative of / at a point x is D_f(xo) = liica x y Xo x> _^ x °' . 

Theoretical Exercise 

1. Suppose / is defined on an open interval containing x , then f'{x ) exists if and only if 
D+f(x ) = D-f(x ). If f'(x ) exists then D + f(x ) = D-f(x ) = f'(x ). 
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We write /" for the derivative of /'. For an integer n > we write f( n > to denote the result of 
differentiating / n times. 

Theorem 11.4.1 (Mean value form of Taylor's theorem) Let f be a function on an interval (a, b) 
such that its nth derivative f^ n > exists on (a, b). Then for a < x,xq < b, 

/w _ g/^- Xo) * + / ( "M( r *or 

k=0 

for some y between x and x$. 

Clearly differentiable functions are continuous. But they can still have rather odd properties, 
as indicated by the following example. 

Example 11.4.2 Let f(t) = t 2 sin(l/t 2 ) for t ^ and /(0) = 0. This function /is a classic 
example of a differentiable function with a derivative function that is not continuous. To check the 
derivative at zero, note that | s | < \s\ — > as s — > 0, so /'(0) = 0. The usual calculus can be 
used to compute f'(t) for t / 0, yielding 

{. -, , 2cos(4t) 
2tsin(£) ^ t/0 
t = 

The derivative /' is not even close to being continuous at zero. As t approaches zero, the cosine 
term dominates, and / reaches both positive and negative values with arbitrarily large magnitude. 

Even though the function / of Example 11.4.2 is differentiable, it does not satisfy the funda- 
mental theorem of calculus (stated in the next section). One way to rule out the wild behavior 
of Example 11.4.2, is to assume that / is continuously differentiable, which means that / is dif- 
ferentiable and its derivative function is continuous. For some applications, it is useful to work 
with functions more general than continuously differentiable ones, but for which the fundamental 
theorem of calculus still holds. A possible approach is to use the following condition. 

Definition 11.4.3 A function f on a bounded interval (open, closed, or mixed) with endpoints 
a < b is continuous and piecewise continuously differentiable, if f is continuous over the interval, 
and if there exist n > 1 and a — to < ti < • ■ ■ < t n — b, such that, for 1 < k < n: f is continuously 
differentiable over (tk-i,tk) and f has finite limits at the endpoints o/ (£&_!,£&). 
More generally, if T is all of K or a subinterval of R, then a function f on T is continuous and 
piecewise continuously differentiable if its restriction to any bounded interval is continuous and 
piecewise continuously differentiable. 

Example 11.4.4 Two examples of continuous, piecewise continuously differentiable functions on 
R are: f(t) = min{i 2 , 1} and g(t) = \ sin(£)|. 
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Example 11.4.5 The function given in Example 11.4.2 is not considered to be piecewise contin- 
uously differentiable because the derivative does not have finite limits at zero. 



Theoretical Exercise 

1. Suppose / is a continuously differentiable function on an open bounded interval (a, b). Show 
that if /' has finite limits at the endpoints, then so does /. 

2. Suppose / is a continuous function on a closed, bounded interval [a,b] such that /' exists 
and is continuous on the open subinterval (a, 6). Show that if the right-hand limit of the 
derviative at a, f'(a+) = liioa x \ a f'(x), exists, then the right-hand derivative at a, defined by 

D + f(a) = Km ^^ 
x\a x — a 

also exists, and the two limits are equal. 

Let g be a function from R n to W 71 . Thus for each vector x G W l , g(x) is an m vector. The 
derivative matrix of g at a point x, ^(x), is the nxm matrix with ijth entry -g^-{x). Sometimes for 
brevity we write y = g(x) and think of y as a variable depending on x, and we write the derivative 
matrix as jf-(x). 

Theorem 11.4.6 (Implicit function theorem) If m = n and if ^ is continuous in a neighborhood 
of xq and if gf (^o) is nonsingular, then the inverse mapping x = g~ 1 (y) is defined in a neighborhood 
of Vo = g(xo) and 

OX ( UV 

^M = (^(*o) 

11.5 Integration 

11.5.1 Riemann integration 

Let g be a bounded function on a bounded interval of the form (a, b]. Given: 

• An partition of (a, b] of the form (to, ti], (ti, £2], • • • , (~tn-i,tn], where n > and 
a = to < t\ ■ ■ ■ < t n = b 

• A sampling point from each subinterval, Vk G (tk-i,tk], for 1 < k < n, 
the corresponding Riemann sum for g is defined by 



^2g{vk){tk -tk- 



fc=i 
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The norm of the partition is defined to be niax^ \tk — tk—i\- The Riemann integral J g{x)dx is 
said to exist and its value is I if the following is true. Given any e > 0, there is a 6 > so that 
I Ylk=i9( v k)(tk ~ tk-i) ~ I\ < e whenever the norm of the partition is less than or equal to 5. 
This definition is equivalent to the following condition, expressed using convergence of sequences. 
The Riemann integral exists and is equal to /, if for any sequence of partitions, specified by 
((£i\ ^P> • • • ' dJ : m — 1)' with corresponding sampling points ((f™, • • • , v™ m ) '■ m > 1), such that 
norm of the m th partition converges to zero as m — > oo, the corresponding sequence of Riemann 
sums converges to I as m — > oo. The function g is said to be Reimann integrable over (a, b] if the 
integral f g(x)dx exists and is finite. 

Next, suppose g is defined over the whole real line. If for every interval (a, b], g is bounded over 
[a, b] and Riemann integrable over (a, 6], then the Riemann integral of g over R is defined by 

,6 

g{x)dx = lim / g{x)dx 

provided that the indicated limit exist as a, 6 jointly converge to +oo. The values +oo or — oo are 
possible. 

A function that is continuous, or just piecewise continuous, is Riemann integrable over any 
bounded interval. Moreover, the following is true for Riemann integration: 

Theorem 11.5.1 (Fundamental theorem of calculus) Let f be a continuously differentiate function 
on R. Then for a < b, 

f(b) - f(a) = f f\x)dx. (11.4) 



More generally, if f is continuous and piecewise continuously differentiable, (11.4) holds with f'{x) 
replaced by the right-hand derivative, D+f(x). (Note that D+f{x) = f'{x) whenever f'{x) is de- 
fined.) 

We will have occasion to use Riemann integrals in two dimensions. Let g be a bounded function 
on a bounded rectangle of the form (o 1 , b 1 ] x (a 2 , b 2 ]. Given: 

• A partition of (a 1 ,^ 1 ] x (a 2 ,b 2 ] into n 1 x n 2 rectangles of the form (ii,ti_i] x (£fe>£fe_i]> where 
n* > 1 and a { = tj, < t{ < ■ ■ ■ < t\. = tf for i = 1, 2 

• A sampling point (vj k ,Vj k ) inside (iLij.J x (i|,i|_i] for 1 < j < n 1 and 1 < k < n 2 , 
the corresponding Riemann sum for g is 

n 1 n 2 

E E 9(vi k ,vl k )(t) - t^it 2 - t 2 _i)- 
j=l fe=l 

The norm of the partition is maxj e n 2 \ maxfc | t k — t l k _ 1 \. As in the case of one dimension, g is said 
to be Riemann integrable over (a 1 , b 1 ] x (a 2 , b 2 ], and J L a i 6 ii x( - a 2 & 2i g{x\,X2)dsdt = I, if the value 
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of the Riemann sum converges to I for any sequence of partitions and sampling point pairs, with 
the norms of the partitions converging to zero. 

The above definition of a Riemann sum allows the n 1 x n 2 sampling points to be selected 
arbitrarily from the n 1 x n 2 rectangles. If, instead, the sampling points are restricted to have the 
form (vh v 2 ,), for n 1 + n 2 numbers v\, . . . , t; 1 !, u 2 , . . . t> 2 2 , we say the corresponding Riemann sum 
uses aligned sampling. We define a function g on [a, b] x [a, b] to be Riemann integrable with aligned 
sampling in the same way as we defined g to be Riemann integrable, except the family of Riemann 
sums used are the ones using aligned sampling. Since the set of sequences that must converge is 
more restricted for aligned sampling, a function g on [a, b] x [a, b] that is Riemann integrable is also 
Riemann integrable with aligned sampling. 

Proposition 11.5.2 A sufficient condition for g to be Riemann integrable (and hence Riemann 
integrable with aligned sampling) over (a 1 ,^ 1 ] x (a 2 ,b 2 ] is that g be the restriction to (a^fr 1 ] x 
(a 2 ,b 2 ] of a continuous function on [a 1 ,^ 1 ] x [a 2 ,fe 2 ]. More generally, g is Riemann integrable over 
(a 1 , b 1 ] x (a 2 , b 2 ] if there is a partition of (a 1 , b 1 ] x (a 2 , b 2 ] into finitely many subrectangles of the form 
(*)>*)-i] x (*fc)*fc-iL such that g on (t),t}_i] x (i 2 .,* 2 .^] is the restriction to (*},*}_ J x (*fe,*jfc_i] 
of a continuous function on [tj,t^_ 1 ] x [t|,t|_ 1 ]. 

Proposition 11.5.2 is a standard result in real analysis. It's proof uses the fact that continuous 
functions on bounded, closed sets are uniformly continuous, from which if follows that, for any 
e > 0, there is a 5 > so that the Riemann sums for any two partitions with norm less than or 
equal to 5 differ by most e. The Cauchy criteria for convergence of sequences of numbers is also 
used. 



11.5.2 Lebesgue integration 

Lebesgue integration with respect to a probability measure is defined in the section defining the 
expectation of a random variable X and is written as 



E[X] = f X(uj)P(dcu 
Jn 



The idea is to first define the expectation for simple random variables, then for nonnegative random 
variables, and then for general random variables by E[X] = E[X + ] — E[X-]. The same approach 
can be used to define the Lebesgue integral 

o 

g{uj)doj 



for Borel measurable functions g on R. Such an integral is well defined if either J_ g + {uj)duj < +oo 



or S-ooQ-^duj < +oo. 
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11.5.3 Riemann-Stieltjes integration 

Let g be a bounded function on a closed interval [a, b] and let F be a nondecreasing function on 
[a,b\. The Riemann-Stieltjes integral 

,b 

/ g{x)dF{x) (Riemann-Stieltjes) 

J a 

is denned the same way as the Riemann integral, except that the Riemann sums are changed to 

n 

5>(^)Wfc)-^fc-l)) 
fc=l 

Extension of the integral over the whole real line is done as it is for Riemann integration. An 
alternative definition of f_ g(x)dF(x), preferred in the context of these notes, is given next. 



11.5.4 Lebesgue-Stieltjes integration 

Let F be a CDF. As seen in Section 1.3, there is a corresponding probability measure P on the 
Borel subsets of R. Given a Borel measurable function g on R, the Lebesgue-Stieltjes integral of g 
with respect to F is defined to be the Lebesgue integral of g with respect to P: 

/oo /*oo 

g(x)dF(x) = / g{x)P{dx) (Lebesgue) 
-oo J — oo 

The same notation J_ oo g(x)dF(x) is used for both Riemann-Stieltjes (RS) and Lebesgue-Stieltjes 
(LS) integration. If g is continuous and the LS integral is finite, then the integrals agree. In 
particular, j_ xdF{x) is identical as either an LS or RS integral. However, for equivalence of the 
integrals 

p roc 

/ g(X(u))P(du) and / g(x)dF(x), 
Jn J-oo 

even for continuous functions g, it is essential that the integral on the right be understood as an 
LS integral. Hence, in these notes, only the LS interpretation is used, and RS integration is not 
needed. 

If F has a corresponding pdf /, then 

/OO /"OO 

g(x)dF(x) = / g(x)f(x)dx (Lebesgue) 
-oo J —oo 

for any Borel measurable function g. 
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11.6 On the convergence of the mean 

Suppose (X n : n > 1) is a sequence of random variables such that X n -4 Xqo, for some random 
variable X^. The theorems in this section address the question of whether i£[X n ] — > ELY,^]. The 
hypothesis X n -4 Xqo means that for any e > and 6 > 0, P{|X n — A^ooj < e} > 1 — 5. Thus, the 
event that X n is close to X^ has probability close to one. But the mean of X n can differ greatly 
from the mean of X if, in the unlikely event that \X n — X^d is not small, it is very, very large. 

Example 11.6.1 Suppose U is a random variable with a finite mean, and suppose A±,A2,--. is a 
sequence of events, each with positive probability, but such that PL4 n ] - ► 0, and let b\, 62, • • • be a 
sequence of nonzero numbers. Let X n = U + b n lA„ for n > 1. Then for any e > 0, P{\X n — U\ > 
e} < P{X n ^U} = P[A n ] -> as n -> 00, so X n ^ U. However, E[X n ] = E[U] + b n P[A n }. Thus, 
if the b n have very large magnitude, the mean 22[X n ] can be far larger or far smaller than £"[£/], 
for all large n. 

The simplest way to rule out the very, very large values of \X n — Xoo| is to require the sequence 
(X n ) to be bounded. That would rule out using constants b n with arbitrarily large magnitudes 
in Example 11.6.1. The following result is a good start-it is generalized to yield the dominated 
convergence theorem further below. 

Theorem 11.6.2 (Bounded convergence theorem) Let X\,X2, ■ ■ ■ be a sequence of random vari- 
ables such that for some finite L, P{\X n \ < L} = 1 for all n > 1, and such that X n — > X as 
n -> 00. Then E[X n ) -> E[X). 

Proof. For any e > 0, P{\ X |> L + e} < P{\ X - X n |> e} -> 0, so that P{\ X \> L + e} = 0. 
Since e was arbitrary, P{\ X \< L} = 1. Therefore, P{|X — X n \ < 2L} = 1 for all n > 1. Again let 
e > 0. Then 

\X-X n \<e + 2LI {lx _ Xnl > £} , (11.5) 

so that \E[X]-E[X n }\ = \E[X - X n ]\ < E[\X - X n \] <e + 2LP\X - X n \ > e}. By the hypotheses, 
P{|X — X n \ > e} — > as n — > 00. Thus, for n large enough, l-E^-X"] — PfXnJj < 2e. Since e is 
arbitrary, E[X n ] -+ E[X}. ■ 

Equation (11.5) is central to the proof just given. It bounds the difference \X — X n \ by e on 
the event {\X — X n \ < e}, which has probability close to one for n large, and on the complement 
of this event, the difference \X — X n \ is still bounded so that its contribution is small for n large 
enough. 

The following lemma, used to establish the dominated convergence theorem, is similar to the 
bounded convergence theorem, but the variables are assumed to be bounded only on one side: 
specifically, the random variables are restricted to be greater than or equal to zero. The result is 
that -Ep^n] for large n can still be much larger than i^LX^], but cannot be much smaller. The 
restriction to nonnegative X^s would rule out using negative constants b n with arbitrarily large 
magnitudes in Example 11.6.1. The statement of the lemma uses "liminf," which is defined in 
Appendix 11.2. 
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Lemma 11.6.3 (Fatou's lemma) Suppose (X n ) is a sequence of nonnegative random variables such 
that X n -4 Xoo. Then limmfn^^ E[X n ] > E[Xoo]. (Equivalently, for any 7 < i^fX^], i?[X n ] > 7 
for all sufficiently large n.) 

Proof. We shall prove the equivalent form of the conclusion given in the lemma, so let 7 be 
any constant with 7 < E'fXoo]. By the definition of E'fXoo], there is a simple random variable Z 
with Z < Xoo such that E[Z] > 7. Since Z = Xoo A Z, 

\X n A Z — Z\ = \X n A Z — Xqo A Z\ < \X n — X^ -4 0, 

so that X n /\Z -4 Z. Therefore, by the bounded convergence theorem, lim^^oo E[X n AZ] = E[Z] > 
7. Since i£[X n ] > E[X n A Z n ], it follows that -E[X n ] > 7 for all sufficiently large n. I 

Theorem 11.6.4 (Dominated convergence theorem) If X\,X2, ■ ■ ■ is a sequence of random vari- 
ables and Xoo and Y are random variables such that the following three conditions hold: 

(i) X n -4 Xoo as n — > 00 

(ii) P{\X n \ < Y} = 1 for all n 

(Hi) E[Y) < +00 

thenE[X n ] -► £[*«,]. 

Proof. The hypotheses imply that (X n + Y : n > 1) is a sequence of nonnegative random variables 
which converges in probability to Xoo + Y. So Fatou's lemma implies that liminfyj^oo E[X n + 
Y] > E[Xoo + Y], or equivalently, subtracting E[Y] from both sides, liminfn^oo £?[X n ] > E'fXoo]. 
Similarly, since (— X n + Y:n> 1) is a sequence of nonnegative random variables which converges 
in probability to — Xoo + Y, Fatou's lemma implies that limin^^oo E[— X n + Y] > E[— X^ + V], 
or equivalently, limsupjj^^ E[X n ] < E[Xoo\. Summarizing, 

limsup£;[X n ] < EiXoo] < liminf E[X n ). 

n — >oo n ^°° 

In general, the liminf of a sequence is less than or equal to the limsup, and if the liminf is equal to 
the limsup, then the limit exists and is equal to both the liminf and limsup. Thus, i£[X n ] — > E'fXoo]. 



Corollary 11.6.5 (A consequence of integrability) If Z has a finite mean, then given any e > 0, 
there exits a S > so that if P[A] < 5, then \E[ZIa]\ < e. 

Proof. If not, there would exist a sequence of events A n with P{A n } — > with E[ZIa„] > e- But 
ZIj\ n -4 0, and ZIa„ is dominated by the integrable random variable Z for all n, so the dominated 
convergence theorem implies that E[ZIa„] — * 0, which would result in a contradiction. I 
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Theoretical Exercise 

1. Work backwards, and deduce the dominated convergence theorem from Corollary 11.6.5. 

The following theorem is based on a different way to control the difference between £?[X n ] 
for large n and EIXqo]. Rather than a domination condition, it is assumed that the sequence is 
monotone in n. 

Theorem 11.6.6 (Monotone convergence theorem) Let X\,X2, ... be a sequence of random vari- 
ables such that E[Xi] > — oo and such that X\(oj) < X2{u) < ■ ■ ■ . Then the limit Xoo given by 
^oo(^) — hm n _ >00 X n (u) for all uj is an extended random variable (with possible value co) and 

E[X„\ — > i?[Xoo] as n — > oo. 

Proof. By adding min{0, — X\} to all the random variables involved if necessary, we can assume 
without loss of generality that X\, X2, ■ ■ ■ , and therefore also X, are nonnegative. Recall that E[X] 
is equal to the supremum of the expectation of simple random variables that are less than or equal 
to X. So let 7 be any number such that 7 < -E'f-X']. Then, there is a simple random variable X 
less than or equal to X with E[X] > 7. The simple random variable X takes only finitely many 
possible values. Let L be the largest. Then X < X A L, so that E[X A L] > 7. By the bounded 
convergence theorem, E[X n A L] — > E[X A L\. Therefore, E[X n A L] > 7 for all large enough n. 
Since E[X n A L] < E[X n ) < E[X], if follows that 7 < E[X n ] < E[X) for all large enough n. Since 
7 is an arbitrary constant with 7 < -E[-X"], the desired conclusion, i?[X n ] — > ^[-X"], follows. ■ 



11.7 Matrices 

An m x n matrix over the reals R has the form 

I an 012 • • • ai n ^ 

CL21 0,22 " " " 02 
\ "ml Q"tn2 ' ' ' &mn / 

where Ojj G R for all i,j. This matrix has m rows and n columns. A matrix over the complex 
numbers C has the same form, with aij G C for all i,j. The transpose of an m x n matrix A = (a^-) 
is the n x m matrix A T = (dji). For example 

1 3 

2 1 1 

The matrix A is symmetric if A = A T . Symmetry requires that the matrix A be square: m = n. 
The diagonal of a matrix is comprised by the entries of the form an. A square matrix A is called 
diagonal if the entries off of the diagonal are zero. The n x n identity matrix is the n x n diagonal 
matrix with ones on the diagonal. We write / to denote an identity matrix of some dimension n. 
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If A is an m x k matrix and B is a k x n matrix, then the product AB is the mx n matrix with 
ijth e i emen t Y^i=i a ubij- A vector x is an to. x 1 matrix, where to. is the dimension of the vector. 
Thus, vectors are written in column form: 

/ *i \ 

\ %m / 

The set of all dimension m vectors over R is the m dimensional Euclidean space R m . The inner 
product of two vectors x and y of the same dimension m is the number x T y, equal to YllLi x iVi- 
The vectors x and y are orthogonal if x T y = 0. The Euclidean length or norm of a vector x is given 
by || a; || = (x T x)z . A set of vectors ipi, . . . ,ip n is orthonormal if the vectors are orthogonal to each 
other and \\<pi\\ = 1 for all i. 

A set of vectors v\ , . . . , v n in R m is said to span R m if any vector in R m can be expressed as a 
linear combination ct\V\ + 02^2 + ■ ■ • + ct n v n for some ai, . . . , a n € R. An orthonormal set of vectors 
(pi, . . . , <p n in R m spans R m if and only if n = m. An orthonormal basis for R m is an orthonormal 
set of to vectors in R m . An orthonormal basis (p\, ... , <p m corresponds to a coordinate system for 
R m . Given a vector v in R m , the coordinates of v relative to (f\, ... , (p m are given by «j = <pfv. 
The coordinates a\, . . . , a m are the unique numbers such that u = 01991 + • • • + a m ip m . 

A square matrix U is called orthonormal if any of the following three equivalent conditions is 
satisfied: 

1. U T U = I 

2. UU T = I 

3. the columns of U form an orthonormal basis. 

Given an to x m orthonormal matrix U and a vector v £ R m , the coordinates of v relative to U are 
given by the vector U T v. Given a square matrix A, a vector ip is an eigenvector of A and A is an 
eigenvalue of A if the eigen relation A<p = \<p is satisfied. 

A permutation -k of the numbers 1, . . . , to is a one-to-one mapping of {1, 2, ... , m} onto itself. 
That is (tt(1), . . . ,7r(m)) is a reordering of (1,2, . . . , to). Any permutation is either even or odd. 
A permutation is even if it can be obtained by an even number of transpositions of two elements. 
Otherwise a permutation is odd. We write 

_ J 1 if 7T is even 
(_ — 1 if 7r is odd 

The determinant of a square matrix A, written det(.A), is defined by 



in 



det(A) = ]T(-irn a -M 

7T i=l 

The absolute value of the determinant of a matrix A is denoted by | A \. Thus | A \=\ det(^4) |. 
Some important properties of determinants are the following. Let A and B be mx m matrices. 
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1. If i? is obtained from A by multiplication of a row or column of A by a scaler constant c, 
then det(.B) = cdet(A). 

2. If U is a subset of M. m and V is the image of U under the linear transformation determined 
by A: 

V = {Ax:x€U} 

then 

(the volume of IX) = \ A | x (the volume of V) 

3. det(AB) = det(A) det(B) 

4. det(A) = det(A T ) 

5. \U\ = 1 if U is orthonormal. 

6. The columns of A span IR n if and only if det(yl) / 0. 

7. The equation p(X) = det(A7 — A) defines a polynomial p of degree m called the characteristic 
polynomial of A. 

8. The zeros Ai, A2, • • • , A m of the characteristic polynomial of A, repeated according to mul- 
tiplicity, are the eigenvalues of A, and det(^4) = Y\7=i ^i- The eigenvalues can be complex 
valued with nonzero imaginary parts. 

If K is a symmetric m x m matrix, then the eigenvalues Ai, A2, • • • , A m , are real- valued (not 
necessarily distinct) and there exists an orthonormal basis consisting of the corresponding eigen- 
vectors ipi, (f2, ■ ■ ■ , '■Pm- Let U be the orthonormal matrix with columns <pi, . . . , (p m and let A be 
the diagonal matrix with diagonal entries given by the eigenvalues 



A 



A 2 

V 1 *mj 



Then the relations among the eigenvalues and eigenvectors may be written as KU = UA. Therefore 
K = UAU T and A = U T KU. A symmetric m x m matrix A is positive semideftnite if a T Aa > 
for all m-dimensional vectors a. A symmetric matrix is positive semidefinite if and only if its 
eigenvalues are nonnegative. 

The remainder of this section deals with matrices over C The Hermitian transpose of a matrix 
A is the matrix A*, obtained from A T by taking the complex conjugate of each element of A T . For 
example, 

1 3 + 2 j 
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The set of all dimension m vectors over C is the m-complex dimensional space C m . The inner 
product of two vectors x and y of the same dimension m is the complex number y*x, equal to 
Y^iLi x iUi ■ The vectors x and y are orthogonal if x*y = 0. The length or norm of a vector x is 
given by ||x|| = (x*x)? . A set of vectors <pi,---,<p n 1S orthonormal if the vectors are orthogonal to 
each other and \\<Pi\\ = 1 for all i. 

A set of vectors v\ , . . . , v n in C m is said to span C m if any vector in C m can be expressed as a 
linear combination a\V\ + 02^2 + • • • + a n v n for some a\, . . . ,a n £ C An orthonormal set of vectors 
(pi,...,<p n in C m spans C m if and only if n = m. An orthonormal basis for C m is an orthonormal 
set of m vectors in C m . An orthonormal basis ipi, ... , c/? m corresponds to a coordinate system for 
C m . Given a vector v in R m , the coordinates of v relative to ipi, . . . , <p m are given by a« = ip*v. 
The coordinates ai, . . . , a m are the unique numbers such that v = anpi + • • • + a m (p m . 

A square matrix U over C is called unitary (rather than orthonormal) if any of the following 
three equivalent conditions is satisfied: 

1. U*U = I 

2. UU* = I 

3. the columns of U form an orthonormal basis. 

Given an m x m unitary matrix U and a vector v £ C m , the coordinates of v relative to U are 
given by the vector U*v. Eigenvectors, eigenvalues, and determinants of square matrices over C 
are defined just as they are for matrices over R. The absolute value of the determinant of a matrix 
A is denoted by | A |. Thus \ A\ = \ det(A) |. 

Some important properties of determinants of matrices over C are the following. Let A and B 
bymxm matrices. 

1. If B is obtained from A by multiplication of a row or column of A by a constant c £ C, then 
det(B) = cdet(A). 

2. If U is a subset of C m and V is the image of U under the linear transformation determined 
by A: 



then 



V = {Ax:x£U} 



(the volume of U) = | A \ x (the volume of V) 



3. det(AB) = det(A) det(5) 

4. det*(yl) = det(A*) 

5. J U |= 1 if t/ is unitary. 

6. The columns of A span C n if and only if det(A) / 0. 
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7. The equation p(X) = det(A7 — A) defines a polynomial p of degree m called the characteristic 
polynomial of A. 

8. The zeros Ai, A2, • • • , A m of the characteristic polynomial of A, repeated according to mul- 
tiplicity, are the eigenvalues of A, and det(.A) = Y\i=i Ai- The eigenvalues can be complex 
valued with nonzero imaginary parts. 

A matrix K is called Hermitian symmetric if K = K* . If K is a Hermitian symmetric m x m 
matrix, then the eigenvalues Ai, A2, . . . , A m , are real-valued (not necessarily distinct) and there 
exists an orthonormal basis consisting of the corresponding eigenvectors ipi,ip2, ■ ■ ■ , <f m . Let U be 
the unitary matrix with columns <pi, . . . , ip m and let A be the diagonal matrix with diagonal entries 
given by the eigenvalues 



A 



/Ai ! \ 

A 2 



Then the relations among the eigenvalues and eigenvectors may be written as KU = UA. Therefore 
K = UAU* and A = U*KU. A Hermitian symmetric m x m matrix A is positive semidefinite if 
a*Aa > for all a £ C m . A Hermitian symmetric matrix is positive semidefinite if and only if its 
eigenvalues are nonnegative. 

Many questions about matrices over C can be addressed using matrices over R. If Z is an m x m 
matrix over C, then Z can be expressed as Z = A + Bj, for some m x m matrices A and B over R. 
Similarly, if a; is a vector in C m then it can be written as x = u + jv for vectors u, v £ R m . Then 
Zx = (Au — Bv) + j(Bu + Av). There is a one-to-one and onto mapping from C m to R 2m defined 
by u + jv — > (") . Multiplication of x by the matrix Z is thus equivalent to multiplication of ( u ) by 

Z = [ _, . I . We will show that 
\B A J 

\Z\ 2 = det(Z) (11.6) 

so that Property 2 for determinants of matrices over C follows from Property 2 for determinants 
of matrices over R. 

It remains to prove (11.6). Suppose that A -1 exists and examine the two 2m x 2m matrices 

A -B \ , / A \ ,„ „ N 

B A J and U A + BA-iBj- (1L7) 

The second matrix is obtained from the first by left multiplying each sub-block in the right column of 
the first matrix by A~ l B, and adding the result to the left column. Equivalently, the second matrix 

is obtained by right multiplying the first matrix by I T ) . But det I ) = 1, so 



x I J' \ I 

that the two matrices in (11.7) have the same determinant. Equating the determinants of the two 
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matrices in (11.7) yields det(Z) = det(^4) det(A + BA~ 1 B). Similarly, the following four matrices 
have the same determinant: 



(A + Bj A-Bj\ ( 2A A-Bj\ ( 2A \, 11R x 

{ A-Bj J \ A-Bj A-Bj J \ A-Bj A+MAZlM ) ^^ 



A + B] \ f A + Bj A- Bj \ f 2A A-Bj \ f 2A 

A-Bj 



Equating the determinants of the first and last of the matrices in (11.8) yields that \Z\ 2 = 
det(Z) det* (Z) = det(A + Bj)det(A - Bj) = det(A)det(A + BA^B). Combining these ob- 
servations yields that (11.6) holds if A -1 exists. Since each side of (11.6) is a continuous function 
of A, (11.6) holds in general. 



Chapter 12 



Solutions to Problems 



1.2 Independent vs. mutually exclusive 

(a) If E is an event independent of itself, then P{E) = P{EP\E) = P{E)P{E). This can happen if 
P{E) = 0. If P{E) + 1 then cancelling a factor of P{E) on each side yields P(E) = 1. In summary, 
either P(E) = or P(E) = 1. 

(b) In general, we have P(AUB) = P{A) + P{B) — P{AB). If the events A and B are independent, 
thenP(AU-B) = P(A) + P(B)-P(A)P(B) = 0.3+0.4- (0.3)(0.4) = 0.58. On the other hand, if the 
events A and i? are mutually exclusive, then P{AB) = and therefore P(A\JB) = 0.3 + 0.4 = 0.7. 

(c) If P(A) = 0.6 and P{B) = 0.8, then the two events could be independent. However, if A and 
B were mutually exclusive, then P{A) + P{B) = P{A U B) < 1, so it would not possible for A and 
B to be mutually exclusive if P(A) = 0.6 and P(B) = 0.8. 

1.4 Frantic search 

Let D,T,B, and O denote the events that the glasses are in the drawer, on the table, in the brief- 
case, or in the office, respectively. These four events partition the probability space. 

(a) Let E denote the event that the glasses were not found in the first drawer search. 

P(T\ P\ - P ( rg ) - P(E\T)P(T) _ (1)(0.06) _ 0.06 ^ » ,1 r 

i-{i \rj) — p ^ — p(£|ij)p(£)) + p(s|D=)P(D=) — (0.1)(0.9)+(1)(0.1) — 0.19 ~ u.oio 

(b) Let F denote the event that the glasses were not found after the first drawer search and first 

table search P(B\F) - P(BF) - P(F\B)P(B) 

taoie seaicn. jryr>\r ) — p ^ — P<yF \ D )p^ D ) + p^ F \ T )p(T)+P(F\B)P{B)+P{F\0)P{0) 

__ (l)(0-03) ~ n 99 

— (0.1)(0.9) + (0.1)(0.06) + (1)(0.03) + (1)(0.01) ~ U '^ 

(c) Let G denote the event that the glasses were not found after the two drawer searches, two table 
searches, and one briefcase search. 

P(D\r\ - p{ - OG) - P(G\Q)P(Q) 

2-^i7|«jj p ^ P(G\D)P(D)+P(G\T)P(T)+P(G\B)P(B)+P(G\0)P(0) 

- aXQ- 01 ) „ o 4225 

~ (0.1) 2 (0.9)+(0. 1) 2 (0. 06)+(0.1)(0.03) + (l)(0. 01) V-^^° 

1.6 Conditional probabilities basic computations of iterative decoding 

(a) Here is one of several approaches to this problem. Note that the n pairs (B\, Y±), . . . , (B n , Y n ) 

351 
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are mutually independent, and Aj(6j) = P(B { = bi\Yi = m) = qi (y^\o)+ q '( yi \i) - Therefore 

P(B = l\Y 1 = y 1 ,...,Y n = y n ) = ^ P(B 1 = b 1 ,...,B n = b n \Y 1 = yi ,...,Y n = y n ) 

6i,...,6„:6i©---©6 n =l 

n 

6i,...,6„:6i©-e6n=l i=1 



(b) Using the definitions, 

P(B = l\Zi = zi, . . . , Z k = z k ) 



p(l,Z!,...,Zk) 



p(0,zi,...,z k ) +p(l,zi,...,Zk) 



1 TT fc „ (r\\~ \ i 1 T[k 



where ri = |f J / , ^— . 

1 + r, f^rMzj) 



1.8 Blue corners 

(a) There are 24 ways to color 5 corners so that at least one face has four blue corners (there are 6 
choices of the face, and for each face there are four choices for which additional corner to color blue.) 
Since there are L) = 56 ways to select 5 out of 8 corners, P(B (exactly 5 corners colored blue) = 
24/56 = 3/7. 

(b) By counting the number of ways that B can happen for different numbers of blue corners we 
find P(B) = 6p 4 (l - p) 4 + 24p 5 (l - pf + 24p 6 (l - p) 2 + 8p 7 {l - p) + p 8 . 

1.10 Recognizing cumulative distribution functions 

(a) Valid (draw a sketch) P{X 2 < 5} = P{X < -^5} + P{X > ^5} = Fi(— \/5) + 1 - ^(v 7 ^) = 

e" 5 
2 • 

(b) Invalid. F(0) > 1. Another reason is that F is not nondecreasing 

(c) Invalid, not right continuous at 0. 

1.12 CDF and characteristic function of a mixed type random variable 

(a) Range of X is [0, 0.5]. For < c < 0.5, P{X < c}} = P{U <c + 0.5} = c + 0.5 Thus, 

c<0 
F x (c) = { c + 0.5 0< c< 0.5 

1 c>0.5 

(b) $ x (u) = 0.5 + / a5 ei ux dx = 0.5 + ^g^ 
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1.14 Conditional expectation for uniform density over a triangular region 

(a) The triangle has base and height one, so the area of the triangle is 0.5. Thus the joint pdf is 2 

inside the triangle. 

(b) 



fx(x) 



fxY{x,y)dy 



f-x/2 

p 

-x/2 



Jo'^dy 



x if < x < 1 
$ X J_\ 2dy = 2-x if 1< x < 2 
else 



(c) In view of part (c), the conditional density fv\x{y\ x ) ls n °t weu defined unless < x < 2. In 
general we have 



fr\x(y\x) 



2 
x 



2 
2-x 


not defined 



if < x < 1 and y e [0, f ] 
if < x < 1 and y [0, f ] 
if 1 < x < 2 and y € [a; - 1, f ] 
if 1 < x < 2 and y [x - 1, §] 
if x < or a; > 2 



Thus, for < x < 1, the conditional distribution of Y is uniform over the interval [0, |]. For 
1 < x < 2, the conditional distribution of Y is uniform over the interval [x — 1, |]. 
(d) Finding the midpoints of the intervals that Y is conditionally uniformly distributed over, or 
integrating x against the conditional density found in part (c), yields: 



E[Y\X = x] 



X 

4 
3x-2 



if < x < 1 
if 1 < x < 2 



not defined if x < or x > 2 



1.16 Density of a function of a random variable 

(a) P(X > 0A\X < 0.8) = P(0.4 < X < 0.8|X < 0.8) = (0.8 2 - 0.4 2 )/0.8 2 = f . 

(b) The range of Y is the interval [0, +oo). For c > 0, 

P{-ln(X) < c} = P{ln(X) > -c} = P{X > e~ c } = f^ c 2xdx = 1 - 



-2c 



so f Y (c) 



2exp(-2c) c>0 







else 



That is, Y is an exponential random variable with parameter 2. 



1.18 Functions of independent exponential random variables 

(a) Z takes values in the positive real line. So let z > 0. 



P{Z < z)} 
Differentiating yields that 



P{mm{X 1 ,X 2 } <z} = P{X Y < z or X 2 < z} 

1 - P{X t > z and X 2 > z} = 1 - P(Xi > z]P{X 2 > z} = 1 



e -x lZe -x 2 z 



-(Ai+A 2 )z 



fz(z) 



(Ai + A 2 )e-( Al+A2 ) 2 , z>0 
0, z < 
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That is, Z has the exponential distribution with parameter Ai + A2. 

(b) R takes values in the positive real line and by independence the joint pdf of X\ and X 2 is the 

product of their individual densities. So for r > , 

P{R<r} = P{— ^ < r} = P{Xi < rX 2 } 
X2 

00 rrx2 

/ X 1 e- XlXl X 2 e- X2X2 dxidx 2 
Jo 

00 



A 2 



Differentiating yields that 



(1 _ e- rMX2 )X 2 e- A ^dx 2 = 1 - x x - 

rAi + A 2 



f A i A 2 r > n 

0, r < 



1.20 Gaussians and the Q function 

(a) Cov(3X + 2Y, X + 5Y + 10) = 3Cov(X, X) + 10Cov(y, F) = 3Var(X) + 10Var(F) = 13. 

(b) X + 4Y is 7V(0, 17), so P{X + AY > 2} = Pj^jf > ^%} = Q(; 



■ X+4Y ^ 2 i _ r\t 2 

frf)- 
(c) X-Y is N(0, 2), so P{(X - Y) 2 > 9} = P{(X - F) > 3 oyX - Y < -3} = 2P{^f > ^} 

1.22 Working with a joint density 

(a) The density must integrate to one, so c = 4/19. 
(b) 

f x (x) = \ Tsffr + xvWv =&[! + %] 2<x<3 

fvM = J tU 3 (1 + ^)^=i![1 + ¥] !<^< 2 
^ V ; \ else 

Therefore fx\Y{ x \y) ls wen defined only if 1 < y < 2. For 1 < y < 2: 

f i±|M 2 <x<3 
/x|y(x|y)= ^ " " 

(J lor other x 



1.24 Density of a difference 

(a) Method 1 The joint density is the product of the marginals, and for any c > 0, the probability 
P{|X — Y\ < c} is the integral of the joint density over the region of the positive quadrant such 
that {\x — y\ < c}, which by symmetry is one minus twice the integral of the density over the region 
{y > Oandy < y+c}. Thus, P{X-Y\ < c} = l-2/ °°exp(-A(y+c))Aexp(-Ay)dy = l-exp(-Ac). 
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Thus, fz(c) = < n i _ That is, Z has the exponential distribution with parameter 

A. 

(Method 2 The problem can be solved without calculation by the memoryless property of the ex- 
ponential distribution, as follows. Suppose X and Y are lifetimes of identical lightbulbs which are 
turned on at the same time. One of them will burn out first. At that time, the other lightbulb will 
be the same as a new light bulb, and \X — Y] is equal to how much longer that lightbulb will last. 



1.26 Some characteristic functions 

(a) Differentiation is straight-forward, yielding jEX = $'(0) = 2j or EX = 2, and j 2 E[X 2 ] = 
$"(0) = -14, so Var(x) = 14 - 2 2 = 10. In fact, this is the characteristic function of a iV(10, 2 2 ) 
random variable. 

(b) Evaluation of the derivatives at zero requires l'Hospital's rule, and is a little tedious. A simpler 
way is to use the Taylor series expansion exp(j'u) = 1 + (ju) + (ju) 2 / 21 + (ju) 3 /3L. The result 
is EX = 0.5 and Var(X) = 1/12. In fact, this is the characteristic function of a U(0, 1) random 
variable. 

(c) Differentiation is straight-forward, yielding EX = Var(X) = A. In fact, this is the characteristic 
function of a Poi(X) random variable. 

1.28 A transformation of jointly continuous random variables 

(a) We are using the mapping, from the square region {(u, v) : < u, v < 1} in the u — v plane to 
the triangular region with corners (0,0), (3,0), and (3,1) in the x — y plane, given by 

x = 3u 

y = uv. 

The mapping is one-to-one, meaning that for any (x, y) in the range we can recover (u, v). Indeed, 
the inverse mapping is given by 

x 
U = 3 

v = — . 

X 

The Jacobian determinant of the transformation is 

J(u, v) = det (ft* & ) = det I " " ) = 3n / 0, for all u, v e (0, l) 2 



M ay 

du dv 




Therefore the required pdf is 



, , v fu,v(u,v) 9u 2 v 2 2 9y 2 

fx,Y{x,y) = -i— ( rr~ = — — p = 3uv = 

\J(u,v)\ \3u\ x 
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within the triangle with corners (0,0), (3,0), and (3,1), and fxy{x,y) = elsewhere, 
(b) Integrating out y from the joint pdf yields 

f x (x) = { Jo f ¥^=T if0<x<3 
V ' \ else 

Therefore the conditional density fY\x{y\ x ) is wen defined only if < x < 3. For < x < 3, 

fx,Y(x,y) / §i^ if < y < f 
/x(z) else 



1.30 Jointly distributed variables 

(a) E[^j] = EiV^Ei^j] = f»v>Xe-*dvfi j^du = (£)(ln(2)) = *§*. 

(b) P{U <V} = £ / u °° Xe- Xv dvdu = J* e~ Xu du = (1 - e" A )/A. 

(c) The support of both /{/y and /yz is the strip [0, 1] x [0, oo), and the mapping (u, v) — > (y, z) 
defined by y = u 2 and z = uv is one-to-one. Indeed, the inverse mapping is given by u = y? and 
v = zy~2 . The absolute value of the Jacobian determinant of the forward mapping is 



V2 

d(x,y) | 



2u 

v u 



I 9(m,i>) I 

2u 2 = 2y. Thus, 

fy, z (y,z) = l &~ XZV ~ h (^-) ^ [0,1] x [0,oo) 
otherwise. 



2.2 The limit of the product is the product of the limits 

(a) There exists n\ so large that \y n — y\ < 1 for n > ni. Thus, \y n \ < L for all n, where 
L = max{|yi|, \y 2 \, . . . , \y ni -i\, \y\ + !}■• 

(b) Given e > 0, there exists n e so large that |x n — x\ < ^ and |y n — y\ < 2 ,< rrpj ■ Thus, for n > ra e , 

|zn2/n - xy| < |(x n - x)y n | + |a;(y„ - y)\ < \x„ - x\L + \x\\y n - y\ < - + - < e. 

So x n y n — > xy as n — > oo. 

2.4 Limits of some deterministic series 

(a) Convergent. This is the power series expansion for e x , which is everywhere convergent, evalu- 
ated at x = 3. The value of the sum is thus e 3 . Another way to show the series is convergent is to 
notice that for n > 3 the n th term can be bounded above by ^-j = frfl • • • - < (4.5)(|) n_3 . Thus, 
the sum is bounded by a constant plus a geometric series, so it is convergent. 
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(b) Convergent. Let < 77 < 1. Then Inn < n v for all large enough n. Also, n + 2 < 2n for all large 
enough n, and n+5 > n for all n. Therefore, the n th term in the series is bounded above, for all suffi- 
ciently large n, by "'" = 2n v ~ 2 . Therefore, the sum in (b) is bounded above by finitely many terms 
of the sum, plus 2 S^Li n v ~ 2 , which is finite, because, for a > 1 , J^Li n ~ a < 1 + Ji°° x ~ a dx = -33;, 
as shown in an example in the appendix of the notes. 

(c) Not convergent. Let < r] < 0.2. Then log(n + f ) < n v for all n large enough, so for n large 
enough the n th term in the series is greater than or equal to n~ 5ri . The series is therefore divergent. 
We used the fact that X^nLi n ~ a 1S infinite for any < a < 1 , because it is greater than or equal 
to the integral j 1 x~ a dx, which is infinite for < a < 1 . 

2.6 Convergence of sequences of random variables 

(a) The distribution of X n is the same for all n, so the sequence converges in distribution to any 
random variable with the distribution of X\. To check for mean square convergence, use the fact 
cos(a) cos(6) = (cos(a+6)+cos(a— b))/2 to calculate that E[X n X m ] = \ if n = m and E[X n X m ] = 
if n / m. Therefore, lim njm ^ 00 E[X n X m ] does not exist, so the sequence (X n ) does not satisfy 
the Cauchy criteria for m.s. convergence, so it doesn't converge in the m.s. sense. Since it is a 
bounded sequence, it therefore does not converge in the p. sense either. (Because for bounded 
sequences, convergence p. implies convergence m.s.) Therefore the sequence doesn't converge in 
the a.s. sense either. In summary, the sequence converges in distribution but not in the other three 
senses. (Another approach is to note that the distribution of X n — A^ n is the same for all n, so 
that the sequence doesn't satisfy the Cauchy criteria for convergence in probability.) 

(b) If u> is such that < &(u) < 2iv, then |1 — — ^| < f so that lim n ^oo Y n (u) = for such u>. 
Since P{0 < ®(u>) < 2tt} = 1, it follows that (Y n ) converges to zero in the a.s. sense, and hence 
also in the p. and d. senses. Since the sequence is bounded, it also converges to zero in the m.s. 
sense. 

2.8 Convergence of random variables on (0,1], version 2 

(a) (a.s, p., d., not m.s.) For any u £ $1 fixed, the deterministic sequence X n (oj) converges to zero. 
So X n — > a.s. The sequence thus also converges in p. and d. If the sequence converged in the 
m.s. sense, the limit would also have to be zero, but 

E[\X n - 0| 2 ] = E[X n \ 2 ] = -1 / -du: = +00 /. 0. 

The sequence thus does not converge in the m.s. sense. 

(b) (a.s, p., d., not m.s.) For any wed fixed, except the single point I which has zero probability, 
the deterministic sequence X n {uS) converges to zero. So X n — > a.s. The sequence also converges 
in p. and d. If the sequence converged in the m.s. sense, the limit would also have to be zero, but 

E[\X n - 0| 2 ] = E[X n \ 2 } = n 2 [ cj 2n doj = -^— -fr 0. 

Jo 2n + l 

The sequence thus does not converge in the m.s. sense. 

(c) (d. only) For lu fixed and irrational, the sequence does not even come close to settling down, 
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so intuitively we expect the sequence does not converge in any of the three strongest senses: a.s., 
m.s., or p. To prove this, it suffices to prove that the sequence doesn't converge in p. Since the 
sequence is bounded, convergence in probability is equivalent to convergence in the m.s. sense, so it 
also would suffice to prove the sequence does not converge in the m.s. sense. The Cauchy criteria 
for m.s. convergence would be violated if E[(X n — A^n) 2 ] y^> astn oo. By the double angle 
formula, X2 n (u) = 2lj s'm(2Trnuj) cos(2Trnuj) so that 

E[(X n - X 2n ) 2 } = I u) 2 {sm{2-Knuj)) 2 {l-2cos{2imu>)) 2 duj 
Jo 

and this integral clearly does not converge to zero as n — > oo. In fact, following the heuristic 

reasoning below, the limit can be shown to equal E[s'm 2 (0)(l — 2 cos(G)) 2 ]/3, where G is uniformly 

distributed over the interval [0,27r]. So the sequence (X n ) does not converge in m.s., p., or a.s. 

senses. 

The sequence does converge in the distribution sense. We shall give a heuristic derivation of the 

limiting CDF. Note that the CDF of X n is given by 

F X n {c) = / I{f(u,) 8 m<2Tvnu>)<c}d u} ( 12 -!) 

JO 

where / is the function defined by /(w) = u. As n — > oo, the integrand in (12.1) jumps between 
zero and one more and more frequently. For any small e > 0, we can imagine partitioning [0, 1] into 
intervals of length e. The number of oscillations of the integrand within each interval converges to 
infinity, and the factor f(u) is roughly constant over each interval. The fraction of a small interval 
for which the integrand is one nearly converges to P {/(w) sin(G) < c} , where G is a random 
variable that is uniformly distributed over the interval [0,27r], and uj is a fixed point in the small 
interval. So the CDF of X n converges for all constants c to: 

/ P{/((j)sin(e) < c}dco. (12.2) 

Jo 

(Note: The following observations can be used to make the above argument rigorous. The integrals 
in (12.1) and (12.2) would be equal if / were constant within each interval of the form (-, ^-). If / 
is continuous on [0, 1], it can be approximated by such step functions with maximum approximation 
error converging to zero as n — > oo. Details are left to the reader.) 

2.10 Convergence of a sequence of discrete random variables 

(a) The CDF of X n is shown in Figure 12.1. Since F n (x) = Fx [x — -) it follows that lim n ^ 00 F n (x) = 
Fx(z-) all x. So linin^oo F n (x) = Fx(x) unless Fx(x) / Fx(x— ) i.e., unless x = 1,2,3,4,5, or 6. 

(b) Fx is continuous at x unless x £ {1,2,3,4,5,6}. 

(c) Yes, linin^oo X n = X d. by definition. 

2.12 Convergence of a minimum 

(a) The sequence (X n ) converges to zero in all four senses. Here is one proof, and there are others. 
For any e with < e < 1, P{\X n - 0| > e} = P{U\ > e, . . . , U n > e} = (1 - e) n , which converges 
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Figure 12.1: F x 



to zero astn oo. Thus, by definition, X n — > p. Thus, the sequence converges to zero in d. 
sense and, since it is bounded, in the m.s. sense. For each u, as a function of n, the sequence of 
numbers X\{oj), X2(u), ... is a nonincreasing sequence of numbers bounded below by zero. Thus, 
the sequence X n converges in the a.s. sense to some limit random variable. If a limit of random 
variables exists in different senses, the limit random variable has to be the same, so the sequence 
(X n ) converges a.s. to zero. 

(b) For n fixed, the variable Y n is distributed over the interval [0,n 9 ], so let c be a number in that 
interval. Then P{Y n < c] = P{X n < en' 9 } = l-P{X n > en' 9 } = !-(!■ 



-cn- u ) n . Thus, if 6= 1, 
1 — exp(— c) for any c > 0. Therefore, if 6 = 1, the 
sequence (Y n ) converges in distribution, and the limit distribution is the exponential distribution 
with parameter one. 



lim^oo P{Y n < c} = 1 - lim n ^ 00 (l 



2.14 Limits of functions of random variables 

(a) Yes. Since g is a continuous function, if a sequence of numbers a n converges to a limit a, then 
g(a n ) converges to g(a). Therefore, for any w such that lim n ^ 00 X n (uj) = X(u>), it holds that 
liniyj^oo g{X n {uj)) = g{X{uo)). If X n — > X a.s., then the set of all such u) has probability one, so 
g(X n ) -> g(X) a.s. 

(b) Yes. A direct proof is to first note that \g(b) — g(a)\ < |6 — o| for any numbers a and b. So, if 
X n -> X m.s., then E[\g(X n ) - g(X)\ 2 } < E[\X - X n \ 2 } -> as n -> oo. Therefore g{X n ) -» g(X) 
m.s. A slightly more general proof would be to use the continuity of g (implying uniform continuity 
on bounded intervals) to show that g{X n ) — > g(X) p., and then, since g is bounded, use the fact 
that convergence in probability for a bounded sequence implies convergence in the m.s. sense.) 

(c) No. For a counter example, let X n = ( — l) n /n. Then X n — > deterministically, and hence in 
the a.s. sense. But h(X n ) = ( — l) n , which converges with probability zero, not with probability 
one. 

(d) No. For a counter example, let X n = (— l) n /n. Then X n — > deterministically, and hence in 
the m.s. sense. But h(X n ) = ( — l) n does not converge in the m.s. sense. (For a proof, note that 
E[h(X m )h(X n )] = ( — l) m+n , which does not converge as m,n —> oo. Thus, h(X n ) does not satisfy 
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the necessary Cauchy criteria for m.s. convergence.) 



2.16 Sums of i.i.d. random variables, II 

2 C ^2 



(a)$ Xl ( u ) = U ju +h~ iu = cos(u), so® Sn (u) =® Xl ( u ) n = (cos(u)) n , and$y n (u) = § Sn (u/*Jn) 



cos(u/^/n) n . 
(b) 



1 if u is an even multiple of tt 
lim $s n (u) = ^ does not exist if u is an odd multiple of ■k 
if u is not a multiple of tt. 





( 


1 , 


( H N 


^ 2 , 


f U 2 


lim 


1 " 


"- 




+ 




n^oo 


\ 


2 ' 


W™, 


/ ' 


V n 



2 



lim $y n (u) = lim 1 - - — — + o — = e 2 



(c) S" n does not converge in distribution, because, for example, lim n ^ 00 $5 n (7r) = lim n ^ 00 (— 1)™ 
does not exist. So S n does not converge in the m.s., a.s. or p. sense either. The limit of $y n is 
the characteristic function of the iV(0, 1) distribution, so that (V n ) converges in distribution and 
the limit distribution is iV(0, 1). It will next be proved that V n does not converge in probability. 
The intuitive idea is that if m is much larger than n, then most of the random variables in the sum 
defining V m are independent of the variables defining V n . Hence, there is no reason for V m to be 
close to V n with high probability. The proof below looks at the case m = 2n. Note that 

T/ T r X\ + ■ ■ ■ + X<m Xi + ••• + X n 
V2n - V n = -j== -= 

V2-2 \ x 1 + --- + x n \ i 1 f X n+1 + --- + X 2n 
2 1 ^ J v^l v^ 

The two terms within the two pairs of braces are independent, and by the central limit theo- 
rem, each converges in distribution to the N(0, 1) distribution. Thus linin^oo d. Vi n — V n = W, 

where W is a normal random variable with mean and Var(VF) = I — 2 ~ ) + ( —7= ) = 2 — V2. 



2 / v^y 

Thus, linin^oo P{\V2 n — V n \ > e) / so by the Cauchy criteria for convergence in probability, V, 
does not converge in probability. Hence V n does not converge in the a.s. sense or m.s. sense either. 

2.18 On the growth of the maximum of n independent exponentials 

(a) Let n > 2. Clearly F Zn (c) = for c < 0. For c> 0, 

Fz n (c) = P{max{X 1 ,...,X n }<c]nn} 

= P{X\ < clnn, X2 < clnn, ... ,X n < clnn} 

= P{Xi < clnn}P{X 2 < c\nn}---P{X n < clnn} 

= (l - e - cInn ) n = (1 - n" c ) n 
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(b) Or, Fx n (c) = (1 + ^) n , where x n = —n c . Observe that as n — > oo, 




so by Lemma 2.3.1 (and the monotonicity of the function e x to extend to the case x = — oo), 

c< 1 

Fz n {c)^{ e- 1 c=l 

1 c> 1. 

Therefore, if Z^ is the random variable that is equal to one with probability one, then Fz n {c) — > 
Fz^(c) at the continuity points (i.e. at c / 1) of Fz x - So the sequence (Z„) converges to one in 
distribution. 

2.20 Limit behavior of a stochastic dynamical system 

Due to the persistent noise, just as for the example following Theorem 2.1.5 in the notes, the 
sequence does not converge to an ordinary random variables in the a.s., p., or m.s. senses. To gain 
some insight, imagine (or simulate on a computer) a typical sample path of the process. A typical 
sample sequence hovers around zero for a while, but eventually, since the Gaussian variables can 
be arbitrarily large, some value of X n will cross above any fixed threshold with probability one. 
After that, X n would probably converge to infinity quickly. For example, if X n = 3 for some 
n, and if the noise were ignored from that time forward, then X would go through the sequence 
9,81,6561,43046721,1853020188851841,2.43 x 10 30 , . . ., and one suspects the noise terms would 
not stop the growth. This suggests that X n — > +oo in the a.s. sense (and hence in the p. and d. 
senses as well. (Convergence to +oo in the m.s. sense is not well defined.) Of course, then, X n does 
not converge in any sense to an ordinary random variable. 

We shall follow the above intuition to prove that X n — > oo a.s. If W n -± > 3 for some n, then 
X n > 3. Thus, the sequence X n will eventually cross above the threshold 3. We say that X diverges 
nicely from time n if the event E n = {X n+k > 3-2 fc for all k > 0} is true. Note that if X n+k > 3-2 fc 
and W n+k > -3 • 2 k , then X n+k+1 > (3 • 2 k ) 2 - 3 • 2 k = 3 • 2 fc (3 • 2 k - 1) > 3 • 2 k+1 . Therefore, 
E n D {X n > 3 and W n+ k > —3 • 2 k for all k > 0}. Thus, using a union bound and the bound 
Q(u) < iexp(-n 2 /2) for u > 0: 

P(E n \X n > 3) > P{W n+k > -3 • 2 k for all k > 0} 

= 1 - P [ult {W n+k < -3 • 2 fc }] 

oo oo 

> l-J2P{W n+k <-3-2 k } = l-J2Q(3-2 k -^) 

fe=0 fc=0 

. oo -. oo _g 

> 1 - - Y exp(-(3 • 2 k f) > 1 - - Y(e- 9 ) k+1 = 1 - — ^ > 0.9999. 

2^ FV v ;;_ 2 ^ v ' 2(1 - e" 9 ) _ 

k=o fc=o v ' 
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The pieces are put together as follows. Let Aq be the smallest time such that X^ > 3. Then Aq is 
finite with probability one, as explained above. Then X diverges nicely from time Aq with proba- 
bility at least 0.9999. However, if X does not diverge nicely from time Aq, then there is some first 
time of the form N± + k such that X^+k < 3 • 2 fc . Note that the future of the process beyond that 
time has the same evolution as the original process. Let N2 be the first time after that such that 
Xn 2 > 3. Then X again has chance at least 0.9999 to diverge nicely to infinity. And so on. Thus, 
X will have arbitrarily many chances to diverge nicely to infinity, with each chance having proba- 
bility at least 0.9999. The number of chances needed until success is a.s. finite (in fact it has the 
geometric distribution), so that X diverges nicely to infinity from some time, with probability one. 



2.22 Convergence analysis of successive averaging 

(b) The means \x n of X n for all n are determined by the recursion /j,q — 0, n\ = 1, and, for n > 1, 
[i n +i = (fj, n + jU n _i)/2. This second order recursion has a solution of the form \i n = AQ™ + Bd^, 
where 9\ and #2 are the solutions to the equation 6 2 = (1 + 0)/2. This yields /j, n = ^(1 — (— 2)")- 

(c) It is first proved that liniji^oo D n = a.s.. Note that D n = U\ ■ ■ ■ U n -\- Since lnD n = 
ln([/i) + •••ln([/ n _i) and E^lnt/j] = L hn(u)du = (xlnx — x)\q = —1, the strong law of large 
numbers implies that linin^oo n _^ = —1 a.s., which in turn implies linin^oolnDn = — 00 a.s., or 
equivalently, linin^oo D n = a.s., which was to be proved. By the hint, for each cu such that D n (u>) 
converges to zero, the sequence X n {uo) is a Cauchy sequence of numbers, and hence has a limit. 
The set of such u> has probability one, so X n converges a.s. 



2.24 Mean square convergence of a random series 

Let Y n = X\ + • • • + X n . We are interested in determining whether linin^oo Y n exists in the m.s. 
sense. By Proposition 2.2.3, the m.s. limit exists if and only if the limit lim m)n ^oo .EfY^jY^] exists 
and is finite. But E[Y m Y n ] = J2k=i a \ which converges to J2h=i a k as n > rn "~ * °°- Thus, (Y n ) 
converges in the m.s. sense if and only if X^fcLi a k < °°- 



2.26 A large deviation 

Since ^[-X - ^] = 2 > 1, Cramer's theorem implies that b = £(2), which we now compute. Note, for 



>0, f° e- ax2 dx= f°° e 2 ^«>dx=J^. So 

1 J — 00 J —00 V a 



J- 2 ] = lT1 / * -* 2 (§-^ - ' 



M{6) = \nE[e yx ]=\nl —=e- x ^- y >dx = — ln(l - 26) 

-, \/2ir 2 
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£{a) = max <j 9a + - ln(l - 29) 





2V a) 


9* 


= W--^ 




2 a 


b 


= m = \{i- 


1006 


= 2.18 xlO" 7 



In 2) = 0.1534 



2.28 A rapprochement between the central limit theorem and large deviations 

(a) Differentiating with respect to 9 yields M'(9) = ( — ^^ — -) / E[exp(9 X)]. Differentiating again 

yields M"(9) = ( * E[x { ^ 0X)] E[exp(dX)] - ( dE l e M ex ^ )^ / E[ex V {9X)} 2 . Interchanging differen- 
tiation and expectation yields f^k — — E[X k exp(6X)]. Therefore, 

M'{9) = E[X exp(9X)]/ E[exp(9X)], which is the mean for the tilted distribution fg, and 
M"(9) = (E[X 2 exp(9X)]E[exp(9X)} - E[X exp(9X)} 2 ) /E[exp(9X)} 2 , which is the second mo- 
ment, minus the first moment squared, or simply the variance, for the tilted density fg. 

(b) In particular, M'(0) = and M"(0) = Var(X) = a 2 , so the second order Taylor's approximation 
for M near zero is M(6) = 9 2 a 2 /2. Therefore, £(a) for small a satisfies £(a) « max e (a9-^-) = £-%, 
so as n — > oo, the large deviations upper bound behaves as P{S n > b^/n} < exp(— n£(b/y/n)) ~ 
exp(— n^-^— ) = exp(— 2~ 2). The exponent is the same as in the bound/approximation to the central 
limit approximation described in the problem statement. Thus, for moderately large b, the central 
limit theorem approximation and large deviations bound/approximation are consistent with each 
other. 



2.30 Large deviations of a mixed sum 

Modifying the derivation for iid random variables, we find that for 9 > 0: 

Pl^>a\ < E[e e ^~ an \ 

= E[e ex T f E[e eY T (1 - f) e- nea 

= exp(-n[9a - fM x {9) - (1 - f)M Y {9))) 

where Mx and My are the log moment generating functions of X\ and Y\ respectively. Therefore, 

l(f,a) = maxtfa - fM x {9) - (1 - f)M Y {9) 

6 
where 

m x m - { " Mi - «> » < 1 My{$) _ ln f; ^f _ H s->) - • - 1, 

k ~~ fc=0 
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Note that l(a, 0) = olna + 1 — a (large deviations exponent for the Poi{\) distribution) and 
l(a, 1) = a — 1 — ln(a) (large deviations exponent for the Exp(l) distribution). For < / < 1 we 
compute /(/, o) by numerical optimization. The result is 



/ 





0+ 


1/3 


2/3 


1 


KfA) 


2.545 


2.282 


1.876 


1.719 


1.614 



Note: 1(4, f) is discontinuous in / at / = 0. In fact, adding only one exponentially distributed 
random variable to a sum of Poisson random variables can change the large deviations exponent. 

2.32 The limit of a sum of cumulative products of a sequence of uniform random vari- 
ables 

(a) Yes. E[(B k - 0) 2 ] = E[A\] k = (|) fc -> as k -> oo. Thus, B k "^t 0. 

(b) Yes. Each sample path of the sequence B k is monotone nonincreasing and bounded below by 
zero, and is hence convergent. Thus, lini/^oo B k a.s. exists. (The limit has to be the same as the 
m.s. limit, so B k converges to zero almost surely.) 

(c) If j < k, then E[BjB k ] = E[A\ ■ ■ ■ A?A j+1 ■ ■ ■ A k ] = (§) J "(| ) k ~ j ■ Therefore, 

n m n m oo oo 

E[S n S m ] = E[J2B j J2 B k]=J2J2 E i B J B k]^J2J2 E i B ^] ( 12 - 3 ) 

3=1 fe=l j=l k=\ 3=1 k=l 

= »EEfs) (i) +E /5 

3=1 k=j+l 




(12.4) 



3 V ; 3 



A visual way to derive (12.4), is to note that (12.3) is the sum of all entries in the infinite 2-d array: 



(l)(!) 2 (I) 2 (!) (I) 8 ••• 

(f)(1) (I) 2 (I) (f) ••• 

(I) (§)(!) (I)(!) 2 ••• 

Therefore, (|) I 2 Y^^ 1 (|) + 1 J is readily seen to be the sum of the j term on the diagonal, 

plus all terms directly above or directly to the right of that term. 

(d) Mean square convergence implies convergence of the mean. Thus, the mean of the limit is 
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lim n ^ 00 E[S n ] = linin^oo Ylk=i E[Bk] = 2fcLi(f) = 3- The second moment of the limit is the 
limit of the second moments, namely ^, so the variance of the limit is ^ — 3 2 = I. 
(e) Yes. Each sample path of the sequence S n is monotone nondecreasing and is hence convergent. 
Thus, linin^oo S n a.s. exists. The limit has to be the same as the m.s. limit. (To show that the limit 
is finite with probability one, and hence an ordinary random variable, the monotone convergence 
theorem could be applied, yielding Pflinin^oo Soo] = limn^oo E[S n ] = 1. ) 



3.2 Linear approximation of the cosine function over an interval 

E[Y\Q] = E[Y] + ^g§P(6 - P[G]), where E[Y] = \ ft cos(d)dd = 0, P[G] = §, Var(9) = g, 

E[@Y] = fiJ^M = ^?% - j; s -^d9 = -I , and Cov(G,F) = E[@Y] - E[G}E[Y] = - 2 - . 
Therefore, P[Y|G] = — j(@ — |), so the optimal choice is a = — and b 



24 



3.4 Valid covariance matrix 

Set a = 1 to make K symmetric. Choose b so that the determinants of the following seven matrices 
are nonnegative: 

(2) (D (1) {l\) (l\) (JJ) *** 

The fifth matrix has determinant 2 — b 2 and det(P) = 2 — 1 — b 2 = 1 — b 2 . Hence K is a valid 
covariance matrix (i.e. symmetric and positive semidefinite) if and only if a = 1 and — 1 < b < 1. 

3.6 Conditional probabilities with joint Gaussians II 

(a) P{\X - 1| > 2} = P{A <-lorI>3} = P{§ < -§} + P{# > §} = $(-§) + 1 - $(§). 

(b) Given Y" = 3, the conditional density of X is Gaussian with mean P[A] + ^r ,'. (3 — P[A]) = 1 

and variance Var(A) ^ ' ^ = 4—^ = 2. 

(c) The estimation error X — E[X\Y] is Gaussian, has mean zero and variance 2, and is indepen- 
dent of Y. (The variance of the error was calculated to be 2 in part (b)). Thus the probability is 
^( - 72 ) + 1 ~ ^(79)' wnicn can also be written as 2$ (—4=) or 2(1 - $(4=))- 

3.8 An MMSE estimation problem 

(a) £[jy] = 2 L J 2 x xydxdy = j^. The other moments can be found in a similar way. Alterna- 
tively, note that the marginal densities are given by 



/,o-M 2 V } eisf" 1 My)H -'-* !<«<-' 



y o< y <i 
-1/ i<: 

else 



so that PA = §, Var(A) = A, PY" = 1, Var(F) = §, Cov(A, F) = ^ - | = ^. So 
P[A|Y] = i + l(I)- 1 (F-l) = i + 1 ~ ' 



3 12V y 3 2 

E[e 2 ] = ^-(^)(^) _1 (^) = 4 = the MMSE for P[A|y] 
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Inspection of Figure 12.2 shows that for < y < 2, the conditional distribution of X given Y = y is 
the uniform distribution over the interval [0, y/2] if < y < 1 and the over the interval [y — l,y/2] 
if 1 < y < 2. The conditional mean of X given Y = y is thus the midpoint of that interval, yielding: 



E[X\Y] 



4 
3Y-2 



0< Y < 1 
1< Y < 2 



To find the corresponding MSE, note that given Y, the conditional distribution of X is uniform 




E[X\Y=y] 



Figure 12.2: Sketch of E[X\Y = y] and £[X|F = y]. 
over some interval. Let L(Y) denote the length of the interval. Then 

E[e 2 ] 



1 



E[E[e 2 \Y}]=E[-L{Y) 2 }. 



h£W* 



i 

96 



For this example, the MSE for the best estimator is 25% smaller than the MSE for the best linear 

estimator. 

(b) 



EX 



1 



\y\ 



-y 2 /2 



dy 



y 



-.e z y dy 



and EY = 0, 



Var(F) = 1, Cov(X, Y) = E[\Y\Y] = so E[X\Y] = \ - + -Y = 

V 7T 1 

That is, the best linear estimator is the constant EX. The corresponding MSE is Var(X) = 

-E^T 2 ] — (EX) 2 = E[Y 2 ] = 1 . Note that \Y\ is a function of Y with mean square error 

E[(X - \Y\) 2 } = 0. Nothing can beat that, so \Y\ is the MMSE estimator of X given Y. So 
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\Y\ = E[X\Y]. The corresponding MSE is 0, or 100% smaller than the MSE for the best linear 
estimator. 

3.10 Conditional Gaussian comparison 

(a) Pa = P{X > 2} = P{-jL > -L} = Q(^L) = Q(0.6324). 

(b) By the theory of conditional distributions for jointly Gaussian random variables, the conditional 
distribution of X given Y = y is Gaussian, with mean i?LY|Y = y] and variance a^, which is the 

MSE for estimation of X by £/[X|V]. Since X and Y are mean zero and v - ' - = 0.8, we have 

£/[X|y = y] = 0.8y, and a\ = Var(X) v - ' ' = 3.6. Hence, given Y = y, the conditional 

distribution of X is N(0.8y, 3.6). Therefore, P(X > 2\Y = y) = Q( -j= — )■ In particular, pb = 
P(X > 2\Y = 3) = Q( 2 ~J|| )3 ) = Q(-0.2108). 



(c) Given the event {Y > 3}, the conditional pdf of Y is obtained by setting the pdf of Y to zero on 
the interval ( — oo, 3), and then renormalizing by P{Y > 3} = Q(-^) to make the density integrate 
to one. We can write this as 

[ else. 

Using this density, by considering the possible values of Y, we have 

/■OO /'CO 

Pc = P(X > 2\Y > 3) = / P(X > 2, Y G dy|y > 3) = / P(X > 2|F = y)P(F e dy|y > 3) 



3 

ca 



Q{ 2 -^ ]L )fy\y>Mdy 
3 V3.6 

(ALTERNATIVE) The same expression can be derived in a more conventional fashion as follows: 

P{X > 2,Y > 3} 



Pe = P(X > 2\Y > 3) 



3 

oo 



CC' 



2 



fx\y(x\y)dx 



P{Y > 3} 

f Y (y)dy/P{Y > 3} 



mA fy(y)dy/(l - F Y (3)) = f°° Q( 2 -^^)f Yl y> 3 (y)dy 



V3.6 / J 3 V3.6 

(d) We will show that p a < Pb < Pc- The inequality p a < pb follows from parts (a) and (b) and 
the fact the function Q is decreasing. By part (c), p c is an average of Q( /= - ) with respect to y 

over the region y G [3, oo) (using the pdf fy\Y>3)- But everywhere in that region, Q( — V— >v ) > pi,, 
showing that p c > Pb- 

3.12 An estimator of an estimator 

To show that £7LY|Y] is the LMMSE estimator of P[A|F], it suffices by the orthogonality principle 
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to note that i?[X|y] is linear in (1,Y) and to prove that £?[X|y] — i£[X|y] is orthogonal to 1 
and to y. However _E[X|y] — £/[X|y] can be written as the difference of two random variables 
(X-E[X\Y]) and (X-E[X\Y\), which are each orthogonal to 1 and to Y. Thus, E[X\Y]-E[X\Y] 
is also orthogonal to 1 and to Y, and the result follows. 

Here is a generalization, which can be proved in the same way. Suppose Vo and Vi are two 
closed linear subspaces of random variables with finite second moments, such that Vo D Vi. Let 
X be a random variable with finite second moment, and let X* be the variable in Vi with the 
minimum mean square distance to X, for i = or i = 1. Then X* is the variable in Vi with the 
minimum mean square distance to Xq. 

Another solution to the original problem can be obtained by using the formula for E[Z\Y] 
applied to Z = E[X\Y]: 

E[E[X\Y]\Y] = E[E[X\Y]] + Cav(Y,E[X\Y])Wax(Y)- 1 (Y-EY) 

which can be simplified using £?[i?LX'|y]] = EX and 

Cov(Y,E[X\Y}) = E[Y(E[X\Y]-EX)j 

= E[YE[X\Y}\ - EY ■ EX 

= E[E[XY\Y}\ - EY ■ EX 

= E[XY] -EX ■ EY = Cov(X, Y) 

to yield the desired result. 

3.14 Some identities for estimators 

(a) True. The random variable _E[X|y] cos(y) has the following two properties: 

• It is a function of Y with finite second moments (because £?LY|y] is a function of Y with 
finite second moment and cos(y) is a bounded function of Y) 

• (X cos(y) — £?[X|y] cos(y)) _L g(Y) for any g with E[g(Y) 2 ] < oo (because for any such g, 

E[(X cos(Y)-E[X\Y] cos(Y))g(Y)] = E[(X-E[X\Y])g(Y)} = 0, where g(Y) = g(Y) cos(y).) 

Thus, by the orthogonality principle, E'LXly] cos(y) is equal to £'[Xcos(y)|y]. 

(b) True. The left hand side is the projection of X onto the space {g(Y) : E[g(Y) 2 ] < oo} and the 
right hand side is the projection of X onto the space {f(Y 3 ) : E[f(Y 3 ) 2 ] < oo}. But these two 
spaces are the same, because for each function g there is the function f(u) = g(v}' s ). The point is 
that the function y 3 is an invertible function, so any function of Y can also be written as a function 

of y 3 . 

(c) False. For example, let X be uniform on the interval [0, 1] and let Y be identically zero. Then 
E[X 3 \Y] = E[X 3 } = \ and E[X\Y} 3 = E[X} 3 = |. 

(d) False. For example, let P{X = Y = 1} = P{X = Y = -1} = 0.5. Then E[X\Y] = Y while 
E'fXjy 2 ] = 0. The point is that the function y 2 is not invertible, so that not every function of Y 
can be written as a function of Y 2 . Equivalently, Y 2 can give less information than Y. 

(e) False. For example, let X be uniformly distributed on [— 1, 1], and let Y — X. Then _E[X|y] = Y 
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while E[X\Y*] = E[X] + %gpp (Y* - E[Y*]) = §§^Y* = \Y\ 

3.16 Some simple examples 

Of course there are many valid answers for this problem-we only give one. 

(a) Let X denote the outcome of a roll of a fair die, and let Y = 1 if X is odd and Y = 2 if X is 
even. Then _E[X|y] has to be linear. In fact, since Y has only two possible values, any function 
of Y can be written in the form a + bY. That is, any function of Y is linear. (There is no need to 
even calculate £?[X|Y"] here, but we note that is is given by -E[F|X] = X + 2.) 

(b) Let X be a N(0,1) random variable, and let W be indpendent of X, with P{W = 1} = P{W = 
— 1} = 2- Finally, let Y = XW . The conditional distribution of Y given W is iV(0, 1), for either 
possible value of W, so the unconditional value of Y is also N(0, 1). However, P{X — Y = 0} = 0.5, 
so that X — Y is not a Gaussian random variable, so X and Y are not jointly Gaussian. 

(c) Let the triplet (X,Y,Z) take on the four values (0, 0, 0), (1, 1, 0), (1, 0, 1), (0, 1, 1) with equal 
probability. Then any pair of these variables takes the values (0, 0), (0, 1), (1, 0), (1, 1) with equal 
probability, indicating pairwise independence. But 

P{(X, Y, Z} = (0, 0, 1)} = / P{X = 0}P{Y = 0}P{Z = 1} = |. So the three random variables 
are not independent. 



3.18 Estimating a quadratic 

(a) Recall the fact that E[Z 2 ] = E[Z] 2 + Var(Z) for any second order random variable Z. The 
idea is to apply the fact to the conditional distribution of X given Y. Given Y, the conditional 
distribution of X is Gaussian with mean pY and variance 1 — p 2 . Thus, i?[X 2 |y] = (pY) 2 + 1 — p 2 . 

(b) TheMSE=£[(X 2 ) 2 ]-£[(£[^ 2 !^]) 2 ] = E[X i \-p i E[Y i \-2p 2 E[Y 2 \{l-p 2 )-{l-p 2 ) 2 = 2(l-p 4 ) 

(c) Since Cov(X 2 , Y) = E[X 2 Y) = 0, it follows that E[X 2 \Y) = E[X 2 ) = 1. That is, the best linear 
estimator in this case is just the constant estimator equal to 1. 



3.20 An innovations sequence and its application 

(a) Y l = Y 1 . (Note: E\Yi] = 1),% = Y 2 - ^^Y x = Y 2 - 0.5Yi (Note: E[Y 2 2 } = 0.75. 

% = Y 3 - ^QYi ~ ^pp2 = Y 3 - (0.5)11 -1%=Y 3 - iH - \Y 2 . Summarizing, 

Y x \ / 1 \ / Y l 

where A= \ -\ 1 | | Y Q 

1 0.5 0.5 

0.5 1 0.5 ] and Cov(X, | Y 2 | ) = (0 0.25 0.25), 
0.5 0.5 1 

1 0.5 0.5 
it follows that Cov | \:, | = A \ 0.5 1 0.5 | A 1 

0.5 0.5 1 
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and that Covf.Y. j Y 2 I ) = (0 0.25 0.25) A 1 = (0 ( ( .,. 
% 

(A n - C ° V ( X ' y l) - h- CoV ( X '^) - I r- C ° V ( X ' y 3) - I 
1 ' E[Y?\ U W £[lf] 3 C E[ y 3 2] 4' 

3.22 A Kalman filtering example 

(a) 



%k+l\k = fXk\k-l + K k(Vk ~ X k \ k _i 

+ 1 = 

! + ** 



^2 ^2/2 ^.2/^.2 , i\-l_2\ , i °fc./ , i 

cr k+1 = } (a k - (T k (a k + 1) a k ) + 1 = — — ^ + 1 



T 2 



and X fc = /(^M. 



1+<T 



fe 



(b) Since ex 2 , < 1 + / for k > 1, the sequence (a 2 .) is bounded for any value of /. 

3.24 A variation of Kalman filtering 

Equations (3.25) and (3.24) are still true: x k+ i\ k = f%k\k-i + Kk(llk — ^felfe-i) an d 

K k = Cov(x/c_|_i — /^fcife-ii yfc)Cov(yfc) -1 . The variable i^ in (3.27) is replaced by w k to yield: 

Cov(x fc +i - fx k \k-i,Vk) = Gav(f(x k -x k \ k _i) + w k ,x k -x k \ k _ l ) + w k 

= f4 + i 

As before, writing a\ for T, k i k _i, 

Cov(y k ) = Cov((x k -x k \ k _ 1 )+w k ) = al + l 



So now 



K k = ^T^Y- and 
a k+1 = {f a k + 1) 2 

1 "T °k 



(1 ~ /) 2 ^ 
1 + -I 

Note that if / = 1 then ct 2 . = for all k > 1, which makes sense because £& = j/fc-i in that case. 

3.26 An innovations problem 

(a) E[Y n ] = E[Ui ■■■U n } = E[Ui] ■ ■ ■ E[U n ] = 2~ n and E[Y%] = E[U? ■ ■ • C/ 2 ] = E[U?] ■ ■ ■ E[U%] = 
3" n , so Var(F n ) = 3" n - (2" n ) 2 = 3" n - 4" n . 

(b) ^[y„|F , • • • , ^n-l] = ^[F„_iC/ n |F , • • • , ^n-l] = l^n-l^nHo, • • • , V„-l] = Jn-l^I^n] = ^n-l/2- 

(c) Since the conditional expectation found in (b) is linear, it follows that E[Y n \Yo, . . . ,Y n -i] = 
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E[Y n jY ,...,Y n _ 1 ] = Y n _ 1 /2. 

(d) Y = Y = 1, and Y n = Y n - F„_i/2 (also equal to U x ■ ■ ■ U n -i(U n - §)) for n > 1. 

(e) For n ^ 1, Var(F„) = E[(Y n ) 2 ] = E[U 2 ■ ■ ■ \J 2 n _ x {U n - \f\ = S"^" 1 )/^ and 

Ccw(X M , r„) = ^[(C/i + • • • + U M )Y n ] = E[(U! + ■■■ + U M )U X ■ ■ ■ U n -i(U n - \)\ 

= E[U n (Ui---Un-i)(U n - §)] = 2-(™- 1 )Var(£/ n ) = 2"( n - 1 )/12. Since Y = 1 and all the other 
innovations variables are mean zero, we have 

FIX IY VI M . ^- Cov(X M ,Yn)Y n 

e[*„|v ,...,v m ] - t + E VarA) 

M ^2- n+1 /12~ 

n=l 

M M 3 

n=i 



3.28 Linear innovations and orthogonal polynomials for the uniform distribution 

(a) 

1 u n , u n+1 I f -7T n even 






2(n+l)|_ 1 i n odd 



71+ 1 



(b) The formula for the linear innovations sequence yields: 
Y = U,Y 2 = U 2 - I, Y 3 = U 3 - ^-, and 

Ya - u A - E[u4 - 1] ■ i - gg!g!zl2W _ i) _ [/4 _ i _ / jzj \ fr7 2 _ n _ ^4 _ 6jj2 , a Note . 

14 — U B[12] i £[([/2_I)2] l u 3^ ~ U 5 ^1-2+1^^ ij — ^ 7 ^ + 35 .lNOTe. 

These mutually orthogonal (with respect to the uniform distribution on [-1,1] ) polynomials 1, U, 
U 2 — ^, U 3 — |[/, U 4 — |[/ 2 + ^ are (up to constant multiples) known as the Legendre polynomials. 

3.30 Kalman filter for a rotating state 

(a) Given a nonzero vector Xk, the vector Fxt is obtained by rotating the vector xt one tenth 
revolution counter-clockwise about the origin, and then shrinking the vector towards zero by one 
percent. Thus, successive iterates F x spiral in towards zero, with one-tenth revolution per time 
unit, and shrinking by about ten percent per revolution. 

(b) Suppose £fc|fc-i = I ) • Note that H^Yl^^iH^ is simply the scaler a. The equations for 
^>k+i\k and Kk can thus be written as 

fa 2 ac \ 

F T + Q 



^k+l\k - 


= F 


£fc|fc-i ~ 


( ■^- 

1+a 

1 ac 


ac 

1+a 
c 2 








V 1+a 


1+a 


K k ~- 


= F 
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4.2 Correlation function of a product 

Rx(s,t) = E[Y s Z s Y t Z t ] = E[Y s Y t Z s Z t ] = E[Y s Y t ]E[Z s Z t ] = R Y (s,t)R z (s,t) 

4.4 Another sinusoidal random process 

(a) Since EX\ = EX2 = 0, EYt = 0. The autocorrelation function is given by 

R Y (s, t) = E[Xf] cos(2tts) cos(27rf) - 2E[XiX 2 ] cos(2vrs) sin(27r£) + E[X 2 } sin(27rs) sin(27rf) 
= a (cos(27rs) cos(27rt) + sin(27rs) sin(27ri)] 
= a cos(2tt(s — t)) (a function of s — t only) 

So (Yf.te R) is WSS. 

(b) If X\ and X2 are each Gaussian random variables and are independent then (Yt : t 6 R) is a 
real-valued Gaussian WSS random process and is hence stationary. 

(c) A simple solution to this problem is to take X\ and X2 to be independent, mean zero, variance 
a 2 random variables with different distributions. For example, X\ could be iV(0, a 2 ) and X2 could 
be discrete with P{X\ = a) = P{X\ = —a) = \. Then Yq = X\ and F 3 / 4 = X2, so lo an d ^3/4 do 
not have the same distribution, so that Y is not stationary. 

4.6 Brownian motion: Ascension and smoothing 

(a) Since the increments of W over nonoverlapping intervals are independent, mean zero Gaussian 
random variables, 



P{W r <W S < W t } 



P{W S - W r > 0, W t - W s > 0} 
P{W S -W r > 0}P{W t - W s > 0} 



1 1 

2 ' 2 



1 
4" 



(b) Since If is a Gaussian process, the three random variables W r ,W s ,Wt are jointly Gaussian. 
They also all have mean zero, so that 



E[W s \W r ,W t ] = E[W s \W r ,W t ] 

= (Cov(W s ,W r ),Cov(W s ,W t )) 



(r,s) 



r r \ f W r 
r t ) \W t 

(t - s)W r + {s- r)W t 
t — r 



Var(X r ) 
Cov(X t ,X r 



Cov(X r ,X t ) 
Var(X t ) 



W r 

w t 



where we use the fact 



a b 
c d 



1 



ad— be 



Note that as s varies from r to t, 



E[W s \W r , Wt] is obtained by linearly interpolating between W r and W t - 
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4.8 Some Poisson process calculations 

(a) (Method 1) Given N 2 = 2, the distribution of the locations of the first two jump points are as 
if they are independently and uniformly distributed over the interval [0, 2]. Thus, each falls in the 
second half of the interval with probability 1/2, and they both fall into the second interval with 
probability (^ ) 2 = 4- Thus, at least one of them falls in the first half of the interval with probabilty 
1-1 = 3 

(Method 2) Since Ni and N 2 — N± are independent, Poi(A) random variables, 

P{N! >1,N 2 = 2} = P{N ± = 1, N 2 = 2} + P{N ± = 2,N 2 = 2} 

= p{iVi = 1, N 2 - Ni = 1} + P{Ni = 2,N 2 -N 1 = 0} 

(Ar- A )(Ar- A ) + (^--)r- A 



,A 2 e-\ 1 3A 2 e" 2A 



and P{N 2 } = ®%f^. Thus, P{N 2 = 2\N l >l) = ^^/(^f^) = §. 

(b) P{Ni > 1} = 1 - e" A . Therefore, P(N 2 = 2\N X > 1) = 3A ^" 2A /(l - e ~ x ). 

(c) Yes. The process N is Markov (because Nq is constant and N has independent increments), 
and since Xt and Nt are functions of each other for each t, the process X is also Markov. The state 
space for X is S = {0, 1,4,9, . . .}. For i,j £ Z and t,r > 0, 

p P>P (r) = P(X t+T = j 2 \X t = i 2 ) = { Cr-0" " * " 3 

\ else 



4.10 MMSE prediction for a Gaussian process based on two observations 

/ 5 -I 

(a) Since Rx(0) = 5,i?x(l) — 0, and Rx{2) = — |, the covariance matrix is I 5 

V-| 5 

(b) As the variables are all mean zero, E[X(A)\X{2)} = Co y^ ( ^g (2)) A"(2) = -^. 

(c) The variable -X"(3) is uncorrelated with (X(2), X(4)) . Since the variables are jointly Gaussian, 
X(3) is also independent of (X (2) , X (A)) T . So E [X (4) | X (2)] = £[X(4)|X(2),X(3)] = -^. 

4.12 Poisson process probabilities 

(a) The numbers of arrivals in the disjoint intervals are independent, Poisson random variables with 
mean A. Thus, the probability is (Ae _A ) 3 = A 3 e -3 \ 

(b) The event is the same as the event that the numbers of counts in the intervals [0,1], [1,2], and 
[2,3] are 020, 111, or 202. The probability is thus e" A (^e- A )e- A + (Ae" A ) 3 + (^e" A )e- A (^e- A ) = 
(f + A 3 + ^)e- 3A . 

(c) This is the same as the probability the counts are 020, divided by the answer to part (b), or 
f /(f + A 3 + £) = 2A 2 /(2 + 4A + A 2 ). 

4.14 Adding jointly stationary Gaussian processes 

(a) R z (s,t) = E [( *(')+*» ) ( *(*)+*-(') )] = l[ Rx{s - t ) + R Y {s-t) + R XY {s-t) + R YX {s-t)). 
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So Rz(s,t) is a function of s — t. Also, Ryx(s,t) 

i? z (r) = i[2e-M + ^l + ' r ^ 



RxY(t, s). Thus, 



2 ' 2 J- 

(b) Yes, the mean function of Z is constant (fj,z = 0) and Rz(s, t) is a function of s — t only, so Z 
is WSS. However, Z is obtained from the jointly Gaussian processes X and Y by linear operations, 
so Z is a Gaussian process. Since Z is Gaussian and WSS, it is stationary. 

(c) P{X(1) < 5F(2) + 1} = PJ X(1) ~ 5y(2) < U = $ (£), where 



a 2 = Var(X(l) - 5F(2)) = i? x (0) - 10Rxy(1 ~ 2) + 25i?y(0) = 1 



10e" 



+ 25 = 26 - 5e" 



4.16 A linear evolution equation with random coefficients 

(a) P k+1 = E[(A k X k + B k f\ = E[A 2 k Xl\ + 2E[A k X k ]E[B k ] + E[B 2 k ] = a\P k + u\. 

(b) Yes. Think of n as the present time. The future values X n+ \, X n+ 2, ■ ■ ■ are all functions of X n 
and (A k , B k : k > n). But the variables (A k , B k : k > n) are independent of Xq, X\, . . . X n . Thus, 
the future is conditionally independent of the past, given the present. 

(c) No. For example, X\ — Xq = X\ = B\, and X2 — X\ = A2B1+B2, and clearly B\ and A2B1 + B2 
are not independent. (Given B\ = b, the conditional distribution of A2-B1 + B2 is N(0, cr\b 2 + <j%), 
which depends on 6.) 

(d) Suppose s and t are integer times with s < t. Then Ry(s,t) = E\Y s (A t _iY t -i + B t -i)] = 

EiAt-dEpsYt-i] + ElYsjElBt^} = 0. Thus, Ry(s, t) = { ^ fc ^ = * = k 

(e) The variables Y±,Y2,... are already orthogonal by part (d) (and the fact the variables have 
mean zero). Thus, Y k = Y k for all k > 1. 



4.18 A fly on a cube 

(a)-(b) See the figures. For part (a), each two-headed line represents a directed edge in each 
direction and all directed edges have probability 1/3. 




(b) 



*- 



2/3 






1/3 W 2/3 



1/3 






-Ti)=M 



(c) Let Oj be the mean time for Y to first reach state zero starting in state i. Conditioning on the 
first time step yields a\ = 1 + |a2, a-2 = 1 + foi + ^03, 03 = 1 + 0,2- Using the first and third 
of these equations to eliminate a± and 03 from the second equation yields 02, and then a\ and 03. 
The solution is (01, 02, 03) = (7, 9, 10). Therefore, E[t] = 1 + a\ — 8. 



4.20 A random process created by interpolation 



(a) 
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(b) Xt is the sum of two random variables, (1 — a)Ut, which is uniformly distributed on the interval 
[0, 1 — a], and aC/ n +i, which is uniformly distributed on the interval [0,a]. Thus, the density of Xt 
is the convolution of the densities of these two variables: 




(c) Cx(t,t) = - — j^ f° r t = n + a. Since this depends on t, X is not WSS. 

(d) P{max <t<io^t < 0.5} = P{U k < 0.5 for < k < 10} = (0.5) 11 . 

4.22 Restoring samples 

(a) Yes. The possible values of X k are {1, . . . , k—1}. Given X k , X k+ \ is equal to X k with probability 

-jr and is equal to X k + 1 with probability 1 £. Another way to say this is that the one-step 

transition probabilities for the transition from X k to X k+ \ are given by 



Pij 







for j 
for j 
else 



i 

i+1 



■x k 



k ■ 



(b) E[X k+1 \X k ] = X fc (^) + (X k + l)(l-^)=X k + l 

(c) The Markov property of X, the information equivalence of X k and M k , and part (b), inply that 
E[M k+1 \M 2 , ...,M k } = E[M k+1 \M k ] = ^(X fc + 1 - ^) / M k , so that (M k ) does not form a 
martingale sequence. 

(d) Using the transition probabilities mentioned in part (a) again, yields (with some tedious algebra 
steps not shown) 



E[D 2 k+1 \X k ] 



X k 
fc + 1 
1 



x k 

~k~ 



+ 



X k + 1 



k-X k 



4k(k + l) 2 
1 

(k + lf 



{{Ak 



)Xl 



k + l 2) \ k 
(4k - 8)kX k + k(k - if) 



k(k 



2)D 2 k + 



1 



(e) Since, by the tower property of conditional expectations, v k+ \ = E[D^, +l ] 
taking the expectation on each side of the equation found in part (d) yields 



E[E[Dl +1 \X k }}, 



v k +i 



1 

(k + l) 



k(k - 2)v k + 



1 
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and the initial condition V2 = holds. The desired inequality, v k < 4^, is thus true for k = 2. For 
the purpose of proof by induction, suppose that v k < -^ for some k > 2. Then, 

{k-2 + 1} < 



4(/c + l) 2 L J - 4(fc + l)' 

So the desired inequality is true for fe+1. Therefore, by proof by induction, v^ < ^r for all fc. Hence, 
ffc — > as fc — > 00. By definition, this means that M& — >' ^ as fe ~ * °°- (We could also note that, 
since M k is bounded, the convergence also holds in probability, and also it holds in distribution.) 

4.24 An M/M/l/B queueing system 

/ -A A \ 

1 -(1 + A) A 

(a) Q= 1 -(1 + A) A 

1 -(1 + A) A 

\ 1 -1 j 

(b) The equilibrium vector n = (itq,7Ti, . . . , 7Tb) solves irQ = 0. Thus, A-7To = 7Ti- Also, A-zro — 
(1 + A)7Ti + 7T2 = 0, which with the first equation yields A7ri = tt2- Continuing this way yields 
that TT n = Xn n -i for 1 < n < B. Thus, n n = X n TTQ. Since the probabilities must sum to one, 
7r n = A7(l + A + -.. + A B ). 

4.26 Identification of special properties of two discrete-time processes (version 2) 

(a) (yes, yes, no). The process is Markov by its description. Think of a time k as the present 
time. Given the number of cells alive at the present time k (i.e. given X^) the future evolution 
does not depend on the past. To check for the martingale property in discrete time, it suffices to 
check that E\Xk + i\X\, . . . , X^] = X^. But this equality is true because for each cell alive at time 
k, the expected number of cells alive at time k + 1 is one (=0.5 x + 0.5 x 2). The process does 
not have independent increments, because, for example, P{X2 — X\ = 0\X± — Xq = —1) = 1 and 
P(X 2 -Xi = 0| X 1 - X = 1) = 1/2. So X 2 - X 1 is not independent of X x - X . 

(b) (yes, yes, no). Let k be the present time. Given Y^, the future values are all determined by 
Yfc, £4 + i, Uk+2, ■ ■ ■■ Since E/fc+i, Uk+2, • • • is independent of Yq, . . . , Y^, the future of Y is condition- 
ally independent of the past, given the present value Y^. So Y is Markov. The process Y is a mar- 
tingale because E[Y k+1 \Y u . . . , Y k ] = Epk+^Y^ . . . , Y k ] = Y k E[U k+1 \Y u . . . , Y k ] = Y k E[U k+l ] = 
Y k . The process Y does not have independent increments, because, for example Y\ — Yq = U\ — 1 
is clearly not independent of Y<i — Y\ = U\{U2 — 1). (To argue this further we could note that the 
conditional density of I2 — Y\ given Y\ — Yq = y — 1 is the uniform distribution over the interval 
[— y, y], which depends on y.) 

4.28 Identification of special properties of two continuous-time processes (version 2) 

(a) (yes, no, no) Z is Markov because W is Markov and the mapping from Wt to Zt is invertible. So 
Wt and Zt have the same information. To see if W 3 is a martingale we suppose s < t and use the 
independent increment property of W to get: 
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E[Wf\ W u , < u < s] = E\Wf\ W s ] = E[(W t - W s + W s ) 3 \ W s ] = 

3E[(W t - W S ) 2 )W S + W* = 3(t - s)W s + W* ± W*. 
Therefore, W s is not a martingale. If the increments were independent, then since W s is the incre- 
ment W s — Wq, it would have to be that E[(W t — W s + W S ) 3 |W S ] doesn't depend on W s . But it 
does. So the increments are not independent. 

(b) (no, no, no) R is not Markov because knowing Rt for a fixed t doesn't quite determines G to 
be one of two values. But for one of these values R has a positive derivative at t, and for the other 
R has a negative derivative at t. If the past of R just before t were also known, then could be 
completely determined, which would give more information about the future of R. So R is not 
Markov. (ii)-R is not a martingale. For example, observing R on a finite interval total determines 
R. So E[Rt\(R u , < u < s] = Rt, and if s — t is not an integer, R s / Rt. (iii) R does not have 
independent increments. For example the increments i?(0.5) — i?(0) and i?(1.5) — R(l) are identical 
random variables, not independent random variables. 

4.30 Moving balls 

(a) The states of the "relative-position process" can be taken to be 111, 12, and 21. The state 111 
means that the balls occupy three consecutive positions, the state 12 means that one ball is in the 
left most occupied position and the other two balls are one position to the right of it, and the state 
21 means there are two balls in the leftmost occupied position and one ball one position to the 
right of them. With the states in the order 111, 12, 21, the one-step transition probability matrix 

/ 0.5 0.5 

is given by P = I 1 

V 0.5 0.5 

(b) The equilibrium distribution n of the process is the probability vector satisfying tt = nP, from 
which we find tt = (3,3,3)- That is, all three states are equally likely in equilibrium, (c) Over a 
long period of time, we expect the process to be in each of the states about a third of the time. 
After each visit to states 111 or 12, the left-most position of the configuration advances one posi- 
tion to the right. After a visit to state 21, the next state will be 12, and the left-most position of 
the configuration does not advance. Thus, after 2/3 of the slots there will be an advance. So the 
long-term speed of the balls is 2/3. Another approach is to compute the mean distance the moved 
ball travels in each slot, and divide by three. 

(d) The same states can be used to track the relative positions of the balls as in discrete time. The 

/ -0.5 0.5 \ 
generator matrix is given by Q = I —1 1 . (Note that if the state is 111 and if the 

V 0.5 0.5 -1 / 
leftmost ball is moved to the rightmost position, the state of the relative-position process is 111 
the entire time. That is, the relative-position process misses such jumps in the actual configuration 
process.) The equilibrium distribution can be determined by solving the equation irQ = 0, and 
the solution is found to be tt = („-, „-, 3) as before. When the relative-position process is in states 
111 or 12, the leftmost position of the actual configuration advances one position to the right at 
rate one, while when the relative-position process is in state is 21, the rightmost position of the 
actual configuration cannot directly move right. The long-term average speed is thus 2/3, as in the 
discrete- time case. 
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4.32 Mean hitting time for a continuous-time, discrete-space Markov process 
„ ( - 1 l ° \ /50 5 1 

Q = 10 -11 1 7T = ( — — — 

V 5 -5 ) U6'56'56 

Consider X^ to get 

a\ = h + (1 — /i)ai + /ici2 + o(/i) 
a 2 = h + 10oi + (1 - H/i)a 2 + o(/i) 

or equivalently 1 — Oi + a 2 + ^tp = and 1 + 10ai — H«2 + ^p = 0. Let h — > to get 1 — ai + a 2 = 
and 1 + 10ai — lla 2 = 0, or a\ = 12 and a 2 = 11. 

4.34 Poisson splitting 

This is basically the previous problem in reverse. This solution is based directly on the definition 
of a Poisson process, but there are other valid approaches. Let X be Possion random variable, and 
let each of X individuals be independently assigned a type, with type i having probability pi, for 
some probability distribution p±, . . . ,Pk- Let JQ denote the number assigned type i. Then, 

p(x 1 = i 1 ,x 2 = i 2 ,--- ,x K = i K ) = p(x = h + ... + i K ) (n , + '. ' ' + %K) ; ft ■ ■ -p£ 

III %2\ ■ ■■ lK ] - 



K „-X 



n 



e~^X 



j 



j= i r 

where Aj = Xpi. Thus, independent splitting of a Poisson number of individuals yields that the 
number of each type i is Poisson, with mean Xi = Xpi and they are independent of each other. 

Now suppose that ./V is a rate A Poisson process, and that iVj is the process of type i points, 
given independent splitting of N with split distribution pi, ■ ■ ■ ,Pk- By the definition of a Poisson 
process, the following random variables are independent, with the i th having the Poi(\(ti+i — U)) 
distribution: 

N(ti)-N(t ) N(t2)-N(h) ••• N(t p ) - N(t p -i) (12.5) 

Suppose each column of the following array is obtained by independent splitting of the corresponding 
variable in (12.5). 



Ni(h) - Ni(t ) JVi(fe)-JVi(ti) ••• Niitp) - Ni(t p -i) 

N 2 (h) - N 2 (t ) N 2 (t 2 ) - N 2 (h) ••• N 2 (t p ) - N 2 (t p -i) 

N K (h) ~ N K (t ) N K (t 2 ) - N K (ti) ••• N K (t p ) - N K (tp-i) 



(12.6) 



Then by the splitting property of Poisson random variables described above, we get that all el- 
ements of the array (12.6) are independent, with the appropriate means. By definition, the i 
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process iVj is a rate Xpi random process for each i, and because of the independence of the rows of 
the array, the K processes TVi, . . . , Nk are mutually independent. 

4.36 Some orthogonal martingales based on Brownian motion 

Throughout the solution of this problem, let < s < t, and let Y = Wt — W s . Note that Y is 
independent of W s and it has the N(0, t — s) distribution. 

(a) E[M t \W 8 ] = M a E[%\W a ]. Now % = exp(0Y - «%=£). Therefore, E[%\W a ] = E[^\ = 1. 
Thus i?[Mt|W s ] = M s , so by the hint, M is a martingale. 

(b) Wf - t = {W s + Y) 2 -s-(t-s) = W 2 -s + 2W S Y + Y 2 - (t - s), but 

E[2W S Y\W S ] = 2W S E[Y\W S ] = 2W S E[Y] = 0, and E[Y 2 - (t - s)\W s ] = E[Y 2 - (t - s)) = 0. It 

follows that £'[2W s y + Y 2 — (t — s)\W s ] = 0, so the martingale property follows from the hint. 

Similarly, 

W?-3tW t = (Y + W s f -3(s + t- s)(Y + W S ) = W^ -3sW s + 3W 2 Y + 3W S (Y 2 -(t-s)) + Y 3 -3tY. 

Because Y is independent of W s and because E[Y] = E[Y 2 — (t — s)} = E[Y 3 ] = 0, it follows that 

E[3W 2 Y + 3W S (Y 2 -(t- s)) + "K 3 - 3tY\ W s ] = 0, so the martingale property follows from the hint. 

(c) Fix distinct nonnegative integers m and n. Then 

E[M n (s)M m (t)] = E[E[M n (s)M m (t)\W s ]] property of cond. expectation 

= E[M n (s)E[M rn (t)\W s \] property of cond. expectation 

= E[M n {s)M m {s)} martingale property 

= orthogonality of variables at a fixed time 



5.2 A variance estimation problem with Poisson observation 

(a) 

P{N = n} = E[P(N = n\X)] = E[ K 



00 x 2n e~ x2 e~^ 

-OO n! V27TCT 2 






dx 



(b) To arrive at a simple answer, we could set the derivative of P{N = n} with respect to a 2 
equal to zero either before or after simplifying. Here we simplify first, using the fact that if X 



is a N(0, a 2 ) random variable, then E'fX 2 ™] = - — ffer^ 1 . Let a 2 be such that ^2" = 1 + 



nl2 „ . ±jcl u uc ouui unau 2 - 2 — ± ~r 2(j2 , 



or 



equivalently, a = 1 f 2a 2 • Then the above integral can be written as follows: 



-x 2 

rr f°° X 2n e 1 ^ 



P{N = n} = - — - — dx 



cio 2n+1 cia 2n 



,, 2n + l ) 

(1 + 2(T 2 ) — 
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where the constant c\ depends on n but not on a 2 . Taking the logarithm of P{N = n} and calcu- 
lating the derivative with respect to a 2 , we find that P{N = n} is maximized at a 2 = n. That is, 



'ML 



(n) 



n. 



5.4 Transformation of estimators and estimators of transformations 

(a) Yes, because the transformation is invertible. 

(b) Yes, because the transformation is invertible. 

(c) Yes, because the transformation is linear, the pdf of 3 + 5G is a scaled version of the pdf of G. 

(d) No, because the transformation is not linear. 

(e) Yes, because the MMSE estimator is given by the conditional expectation, which is linear. That 
is, 3 + 5E[Q\Y] =E[3 + 5Q\Y]. 

(f) No. Typically E[G 3 \Y] / E[G\Y} 3 . 

5.6 Finding a most likely path 

Finding the path z to maximize the posterior probability given the sequence 021201 is the same 
as maximizing p c d(y,z\6). Due to the form of the parameter 9 = (tt,A,B), for any path z = 
(z\, . . . , z§), Pcd(y, z\Q) has the form c e a l for some i > 0. Similarly, the variable Sj(t) has the form 
(fa 1 for some i > 0. Since a < 1, larger values for p c d(y, z\9) and Sj(t) correspond to smaller values 
of i. Rather than keeping track of products, such as a % a?, we keep track of the exponents of the 
products, which for cU'a? would be i + j. Thus, the problem at hand is equivalent to finding a path 
from left to right in trellis indicated in Figure 12.3(a) with minimum weight, where the weight of a 
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Figure 12.3: Trellis diagram for finding a MAP path. 
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path is the sum of all the numbers indicated on the vertices and edges of the graph. Figure 12.3(b) 
shows the result of running the Viterbi algorithm. The value of Sj(t) has the form c l a % , where for 
i is indicated by the numbers in boxes. Of the two paths reaching the final states of the trellis, 
the upper one, namely the path 000000, has the smaller exponent, 18, and therefore, the larger 
probability, namely c 6 a ls . Therefore, 000000 is the MAP path. 

5.8 Estimation of the parameter of an exponential in additive exponential noise 

(a) By assumption, Z has the exponential distribution with parameter 9, and given Z = z, the 
conditional distribution of Y — z is the exponential distribution with parameter one (for any 9.) So 
fcd(y,z\d) = f(z\d)f(y\z,e) where 

/(*|*) = | el " andfor.>0: f(y\z,6) = | Q ^ 

(b) 

f(y\0)= / fcd(v,z\d)dz={ 1-9 7 

Jo 1 ye y 9 = 1. 



(c) 



Q(6\eW) = E[\nf cd (Y,Z\9)\y,9^] 

= ln9 + (l-9)E[Z\y,9^]-y, 



which is a concave function of 9. The maximum over 9 can be identified by setting the derivative 
with respect to 9 equal to zero, yielding: #( fc+1 ) = argmaxg Q{9\d^ k ') = „, „, 1 0(k)] = , L fc) y 
(d) 

Q(9\9^) = E[\nf cd (Y,Z\9)\y,9^} 

T 



Y,E[^f(yt,z t \e) \yt,e {k) ] 
t=i 

T T 

T\nd + (i-e)J2</>(yt,o {k) )-J2yt 



t=i t=i 



which is a concave function of 9. The maximum over 9 can be identified by setting the derivative 
with respect to 9 equal to zero, yielding: 

9 ik+1) = &vgm&xQ(9\9 {k) ) ' 



ELitivt'W) 



5.10 Maximum likelihood estimation for HMMs 

jT 



(a) APPROACH ONE Note that p(y\z) = U; =1 b ZuVt . Thus, for fixed y, p{y\z) is maximized with 
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respect to z by selecting zt to maximize b Ztm for each t. Thus, (^Mi(y))t — arg maxj bi >yt for 

1 < t < T. 

APPROACH TWO Let tt-i = jj- and a,ij = jj- for all states i, j of the hidden Markov process Z. 

The HMM for parameter 9 = (tt, A, B) is such that all Nj possible values for Z are equally likely, 
and the conditional distribution of Y given Z is the same as for the HMM with parameter 9. Use 
the Viterbi algorithm with parameter 9 to compute Zmap, and that is equal to Zml for the HMM 
with parameter 9. 

(b) Let 7Tj = 1 if 7Tj > and 7Tj = if 7Tj = for 1 < i < N s . Similarly, let Ojj = 1 if Ojj > and 
a,ij = if Oj j = for 1 < i,j < N s . While n and the rows of A are not normalized to sum to one, 

they can still be used in the Viterbi algorithm. Under parameter 9 = (jt,A,B), every choice of 
possible trajectory for Z has weight one, every other trajectory has weight zero, and the conditional 
distribution of Y given Z is the same as for the HMM with parameter 9. Use the Viterbi algorithm 
with parameter 9 to compute Zmap, and that is equal to the constrained estimator Zml for the 
HMM with parameter 9. 

(c) Note that P(Y = y\Z\ = i) = f3i(l)bi tyi , where /3j(l) can be computed for all i using the back- 
ward algorithm. Therefore, Zi ; MZ,(y) = argmaxj /3j(l)6j i2/1 . 

1% {t )P{Y=y} 



(d) Note that P(Y = y\Zt = i) = p/ z =\ ■> where Ji(t ) can be computed by the forward 



backward algorithm, and P{Zt o = i} = (irA 1 " 1 )j. Then Zt 0: ML(y) = argmaxj p 7'^ °_ 



5.12 Specialization of Baum- Welch algorithm for no hidden data 

(a) Suppose the sequence y = (yi, ■ ■ ■ , yr) is oberserved. If 9^-°' = 9 = (tt, A, B) is such that B is 
the identity matrix, and all entries of tt and A are nonzero, then directly by the definitions (without 
using the a's and /3's): 

7< (t) = P(Z t = i\Y 1 = y 1 ,...,Y T = y T ,9) = I {yt=t} 
&j(t) = P(Z t = i, Z t+1 =j\Yi = y 1 ,...,Y T = y T , 9) = I{( yt , yt+1 )=(i,j)} 



Thus, (5.27) - (5.29) for the first iteration, t = 0, become 

q = I{ yi=i j i.e. the probability vectc 
(i) number of (i,j) transitions observed 



til = Is yi= i\ i.e. the probability vector for S with all mass on y\ 



a 



1,3 number of visits to i up to time T — 1 



*>,?' 



number of times the state is i and the observation is I 
number of times the state is i 



It is assumed that B is the identity matrix, so that each time the state is i the observation should 
also be i. Thus, b^ = Ui=i\ for any state i that is visited. That is consistent with the assumption 
that B is the identify matrix. (Alternatively, since B is fixed to be the identity matrix, we could 
just work with estimating tt and A, and simply not consider B as part of the parameter to be esti- 
mated.) The next iteration will give the same values of n and A. Thus, the Baum- Welch algorithm 
converges in one iteration to the final value 9^ l > = (tt^ 1 ' , A^- 1 ' , B^ 1 ') already described. Note that, 



383 



by Lemma 5.1.7, 6^' is the ML estimate. 



2 



(b) In view of part (a), the ML estimates are tt = (1,0) and A = I f i ) • This estimator of A 

V 3 3 / 
results from the fact that, of the first 21 times, the sate was zero 12 times, and 8 of those 12 times 

the next state was a zero. So aoo — 8/12 = 2/3 is the ML estimate. Similarly, the ML estimate of 

an is 6/9, which simplifies to 2/3. 

5.14 Baum- Welch saddlepoint 

It turns out that 7T (fe) = tt^ and A^ = A^\ for each k > 0. Also, B (k 1 = B^ for each k > 1, 
where B^ l > is the matrix with identical rows, such that each row of B^ l > is the empirical distribution 
of the observation sequence. For example, if the observations are binary valued, and if there are 
T = 100 observations, of which 37 observations are zero and 63 are 1, then each row of B^ 1 ' would 
be (0.37,0.63). Thus, the EM algorithm converges in one iteration, and unless 6^°' happens to 
be a local maximum or local minimum, the EM algorithm converges to an inflection point of the 
likelihood function. 

One intuitive explanation for this assertion is that since all the rows of B^°> are the same, then 
the observation sequence is initially believed to be independent of the state sequence, and the state 
process is initially believed to be stationary. Hence, even if there is, for example, notable time 
variation in the observed data sequence, there is no way to change beliefs in a particular direction 
in order to increase the likelihood. In real computer experiments, the algorithm may still eventually 
reach a near maximum likelihood estimate, due to round-off errors in the computations which allow 
the algorithm to break away from the inflection point. 

The assertion can be proved by use of the update equations for the Baum- Welch algorithm. It 
is enough to prove the assertion for the first iteration only, for then it follows for all iterations by 
induction. 

Since the rows of B^°>) are all the same, we write bi to denote b-, for an arbitrary value of i. 

By induction on t, we find a>i(t) — b yi • • ■ b yt 7T^ and j3j(t) = b yt+1 • • • b yT . In particular, /3j(t) does 
not depend on j. So the vector (oj/Sj : i £ S) is proportional to 7P™ , and therefore Ji(t) = tx\ . 
Similarly, &,,(£) = P(Z t = i,Z t+1 = j\y,9^) = ^f ] af]. By (5.27), 7T« = vr^, and by (5.28), 
A {\) = A (o)^ Fina ii yj (5.29) gives 



b 



(i) 

i.i 



St=i 7r i-^{y t =i} number of times I is observed 



Tin T 



5.16 Constraining the Baum- Welch algorithm 

A quite simple way to deal with this problem is to take the initial parameter 9^°' = (tt, A, B) in 
the Baum- Welch algorithm to be such that a^ > if and only if a^ = 1 and bu > if and only if 
bu = 1. (These constraints are added in addition to the usual constraints that tt, A, and B have 
the appropriate dimensions, with tt and each row of A and b being probability vectors.) After all, 
it makes sense for the initial parameter value to respect the constraint. And if it does, then the 
same constraint will be satisfied after each iteration, and no changes are needed to the algorithm 
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itself. 

6.2 A two station pipeline in continuous time 

(a) S = {00,01,10,11} 

(b) 




(c)Q 



( -A A \ 

^2 -fJ-2 - A A 

o m -m o 

V il 2 ~H2 J 

(d) 7] = (7T o + vr i)A = (7Toi + 7Tn)jU2 = 7TioA«i- If A = /ii = /i 2 = 1.0 then n = (0.2, 0.2, 0.4, 0.2) and 
7] = 0.4. 

(e) Let r = min{£ > : X(t) = 00}, and define h s = E[r\X(0) = s], for s € S. We wish to find 
/ioo = / hoo \ ( \ 



h 



ii- 



foil 



+ 



Mgfegg I 



Afen 



/^2+A M2+A M2+A For j f A = = ^ = L Q tMs yieMs 



fo()l 

foio 



3 

4 

V 5 y 



/in = 5 is the required answer. 



Thus, 



6.4 A simple Poisson process calculation 

Suppose < s < t and < i < k. 



P(N(s) = i\N(t) = k) 



P{N(s) =i,N(t) = k} 
P{N(t) = k} 

e" As (As) i \ f e- x ^- s \\(t - s)) k - 



n 



(k 



IV. 



s\ i ft 

t. 



k-l 



-At 



(At)* 



/,:! 



That is, given N(t) = k, the conditional distribution of N(s) is binomial. This could have been 
deduced with no calculation, using the fact that given N(t) = k, the locations of the k points are 
uniformly and independently distributed on the interval [0,t]. 



6.6 A mean hitting time problem 

(a) 
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ixQ = implies ir — (y, y, |). 



(b) Clearly ai = 0. Condition on the first step. The initial holding time in state i has mean — - and 



the next state is j with probability pf. 



'-&-. Thus 



a 

a 2 



'/oo 



922 



+ 



Pw 



P02 



a 

a 2 



Solving yields 



a °\ = ( l 
a 2 J V L5 
(c) Clearly a 2 (t) = for all t. 

ao(t + h) 
a\(t + h) 



ao(t)(l + qooh) + ai(t)qioh + o(h) 
ao(t)qoih + «i(t)(l + gn/i) + o(/i) 



Subtract (Xi(t) from each side and let h 



(a ,ai] 



qoo qoi 

qw qn 



with the 



to yield (^f,^ 

inital condition (ao(0),ai(0)) = (1,0). (Note: the matrix involved here is the Q matrix with the 
row and column for state 2 removed.) 
(d) Similarly, 

Po(t-h) = (l + qooh)(3 (t) + qoih(3i(t) + o(h) 
0i(t-h) = qioh/3o(t) + (l + quh)Pi(t))+o(h) 



Subtract /3j(t)'s, divide by h and let h — > to get: 



Po(t f ) 



d/3o 
' dt 

m 

' dt 



qoo qoi 
qw qn 



A) 

ft 



with 



6.8 Markov model for a link with resets 

(a) Let S = {0, 1,2,3}, where the state is the number of packets passed since the last reset. 

A, A, A, 

L TlY" \2\~~ 1 3 

(b) By the PASTA property, the dropping probability is n^. We can find the equilibrium distribu- 
tion 7r by solving the equation irQ = 0. The balance equation for state is A7ro = /i(l — ttq) so that 
^0 — J+J2- The balance equation for state i £ {1,2} is A7Tj_i = (A + /j)lTi, so that ty\ = 7ro(^q— ;) 

and 7T 2 = tto(j^) 2 - Finally, Xn 2 = ^tt 3 so that tt 3 = ^(i^) 2 ^ = n^TTW - The dropping prob- 




>A+/^ 



ability is 713 



A 3 

(a+m) 3 



A+^ /i (A+m) 3 ' 



(This formula for 7r3 can be deduced with virtually no calculation from 
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the properties of merged Poisson processes. Fix a time t. Each event is a packet arrival with 
probability j— and is a reset otherwise. The types of different events are independent. Finally, 
7T3(t) is the probability that the last three events before time t were arrivals. The formula follows.) 

6.10 A queue with decreasing service rate 

(a) 



*■ x 

) ( 1 < 



X XXX 

K ) (K+l) (K+2) 



V H/2 n/2 n/2 




(b) S 2 = EfcLo(^) 2 . where kAK = mm{k,K}. Thus, if A < g then 5 2 < +oc and the 
process is recurrent. Si = J2k ) =o(ir) k ^~ kAK > so if A < ^ then 5i < +oo and the process is positive 



recurrent. In this case, 7^ = ( — )2 7i"o, where 



7!"0 



1 

s~i 



1-(X/H) K + (A//,) 



if i-i 



l-(A//x) 1-(2A/ M ). 
(c) If A = -£, the queue appears to be stable until if fluctuates above if. Eventually the queue- 

2 



length will grow to infinity at rate A — ^ = w. See figure above. 



6.12 An M/M/l queue with impatient customers 

(a) 

A A, A A A 

(•DaCEOTDSC 

u u+a u+2a u+3a u+4a 



fi(fi+a) — (fi+(k—l)a) 



where c is 



(b) The process is positive recurrent for all A, /j, if a > 0, and p^ — 
chosen so that the p^'s sum to one. 

\k h 

(c) If a — fi, pk = fr~k = %r- Therefore, (pk ■ k > 0) is the Poisson distribution with mean p. 
Furthermore, p^ is the mean departure rate by defecting customers, divided by the mean arrival 
rate A. Thus, 



1 



pd = -r^2pk(k- 1; 



A 



a 



p-l + e 



fc=i 



1 as p —> oo 
as p — > 



where 1 'Hospital's rule can be used to find the limit as p — > 0. 



6.14 A queue with blocking 

(a) 
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TT k 



P k (l-P) 



for < k < 5. 



i+p+p 2 +p 3 +p 4 +p 6 ~ i-p' 

(b) ps = ti"5 by the PASTA property. 

(c) W = Nw/(X(1 - Vb)) where N^ = ELi( fc " l)^fc- Alternatively W = iV/(A(l - p B )) - J 
(i.e. W is equal to the mean time in system minus the mean time in service) 



(d) 7T 



A(mean cycle time for visits to state zero) 



l r l 



fore, the mean busy period duration is given by jizf- 
6.16 On two distibutions seen by customers 



A(i/A+mean busy period duration) 
i-p 5 



There- 



1] 



p-p 



A(l-p) M (i_ p ) 




As can be seen in the picture, between any two transtions from state k to k + 1 there is a transition 
form state k + 1 to k, and vice versa. Thus, the number of transitions of one type is within one of 
the number of transitions of the other type. This establishes that \D(k, t) — R(k, t)\ < 1 for all k. 
(b) Observe that 



D(k,t) 
a t 


R(k,t) 
St 


< 


D(k,t) R(k,t) 
a t a t 


+ 


R(k,t) 
a t 


R{k,t) 
St 






< i L i?(M 

~ a t a t 


at 

Ot 










< 


1 
— + 

a t 


, a t 

T t 


^0 


a 


5 t-KX) 





Thus. 



D(k,t) 



R{k,t) 



and ^hii have the same limits, if the limits of either exists. 

at o t ' 



6.18 Positive recurrence of reflected random walk with negative drift 

Let V(x) = \x 2 . Then 



PV(x) - V(x) 



E 



(x + B n + L n ) 2 x l 



< E[ 



(x + B n ) 



xB + 



2 



x 

Y 



Therefore, the conditions of the combined Foster stability criteria and moment bound corollary 



apply, yielding that X is positive recurrent, and X < 



-IB' 



(This bound is somewhat weaker than 
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Kingman's moment bound, disussed later in the notes: X < 



Var(B) 



-2B 



6.20 An inadequacy of a linear potential function 

Suppose x is on the postive X2 axis (i.e. x\ — and X2 > 0). Then, given X{t) = x, during 
the slot, queue 1 will increase to 1 with probability a(l — cq) = 0.42, and otherwise stay at zero. 
Queue 2 will decrease by one with probability 0.4, and otherwise stay the same. Thus, the drift 
of V, E[V(X(t + 1) - V(x)\X(t) = x] is equal to 0.02. Therefore, the drift is strictly positive for 
infinitely many states, whereas the Foster-Lyapunov condition requires that the drift be negative 
off of a finite set C. So, the linear choice for V does not work for this example. 

6.22 Opportunistic scheduling 

(a) The left hand side of (6.37) is the arrival rate to the set of queues in s, and the righthand side 
is the probability that some queue in s is eligible for service in a given time slot. The condition is 
necessary for the stability of the set of queues in s. 

(b) Fix e > so that for all s G E with s / 0, 



5>, + e)< Y, W W 

B:Bns^9 



ie 



Consider the flow graph shown. 




In addition to the source node a and sink node b, there are two columns of nodes in the graph. The 
first column of nodes corresponds to the TV queues, and the second column of nodes corresponds 
to the 2 N subsets of E. There are three stages of links in the graph. The capacity of a link (a,qi) 
in the first stage is m + e, there is a link (#$, Sj) in the second stage if and only if qi G Sj, and each 
such link has capacity greater than the sum of the capacities of all the links in the first stage, and 
the weight of a link (sk,t) in the third stage is w(sk)- 

We claim that the minimum of the capacities of all a — b cuts is v* = X^=i( a * + e )- Here is a 
proof of the claim. The a — b cut ({a} : V — {a}) (here V is the set of nodes in the flow network) 
has capacity v* , so to prove the claim, it suffices to show that any other a — b cut has capacity 
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greater than or equal to v* . Fix any a — b cut (A : B). Let A = A n {q\, . . . , Qn}, or in words, A 
is the set of nodes in the first column of the graph (i.e. set of queues) that are in A. If q^ £ A and 
Sj £ B such that (qi, Sj) is a link in the flow graph, then the capacity of (A : B) is greater than or 
equal to the capacity of link (qi,Sj), which is greater than v* , so the required inequality is proved 
in that case. Thus, we can suppose that A contains all the nodes Sj in the second column such 
that Sj n A / 0. Therefore, 

C(A:B) > Yl (ai + e) + Yl w ( s ) 

i€{q u ...,q N }-A sCE:sC\A^% 

> y, ( ai + e ) + X)(°< + € ) = v *> ( 12 - 7 ) 

ie{gi,...,gjv}— A i£A 

where the inequality in (12.7) follows from the choice of e. The claim is proved. 

Therefore there is an a — b flow / which saturates all the links of the first stage of the flow graph. 
Let u(i, s) = f(qi, s)/f(s, b) for all i, s such that f(s, b) > 0. That is, u(i, s) is the fraction of flow 
on link (s, b) which comes from link (qi, s). For those s such that /(s, b) = 0, define u(i, s) in some 
arbitrary way, respecting the requirements u(i, s) > 0, u(i, s) = if i g - s, and X^ieB u (h s) = I{ s ^®}- 
Then a { + e = /(a, &) = J2 S f(Qu s ) = J2 S f( s > b)u(i, s) < J^ s w(s)u(i, s) = m(u), as required. 

(c) Let V(x) = \ ^2i € E x l- -^ e ^ ^(^) d en °t e the identity of the queue given a potential service at 
time t, with 5(t) = if no queue is given potential service. Then P(5(t) = i\S(t) = s) = u(i, s). The 
dynamics of queue i are given by Xi(t + 1) = Xi(t) + Ai(t) — Ri(S(t)) + Li(t), where Ri(S) = I{s=i}- 
Since J2 ieE (Ai(t) - Ri(5i(t))) 2 < E ieE (A(«)) 2 + (^(^(<))) 2 < AT + £ ie£ ^(i) 2 we have 

PV(x)-V(x) < lj2xi(ai-tM(u))\+K (12.8) 

Viefi / 

< -e(Y Xi ) +K ( 12 - 9 ) 

where K = y + 2j=i ^i- Thus, under the necessary stability conditions we have that under the 
vector of scheduling probabilities u, the system is positive recurrent, and 

VX 4 <- (12.10) 

(d) If u could be selected as a function of the state, x, then the right hand side of (12.8) would 
be minimized by taking u(i, s) = 1 if i is the smallest index in s such that x\ = maxj^ s Xj. This 
suggests using the longest connected first (LCF) policy, in which the longest connected queue is 
served in each time slot. If P LCF denotes the one-step transition probability matrix for the LCF 
policy, then (12.8) holds for any u, if P is replaced by P LGF . Therefore, under the necessary 
condition and e as in part (b), (12.9) also holds with P replaced by P , and (12.10) holds for 
the LCF policy. 
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6.24 Stability of two queues with transfers 

(a) System is positive recurrent for some u if and only if Ai < Hi + v, X2 < fJ-2, and A1 + A2 < Ati+A*2- 

(b) ' 

QV(x) ■■ ~ 



Yl lxy ( F (y) " 


- V(x)) 












y.y^x 














y[(xi + l) 2 -: 


^i 


[(X2 + l) 2 - Xl) 


2 


[(*i- 


1)1 -*?] 


+ 


f[(- 2 -l) 2 + 


-4\ + 


{ 2 X1 - } [(xi - 


-l) 2 - 


-x\ + 


(X2 + I) 2 


— 



\] (12.11) 

(c) If the righthand side of (12.11) is changed by dropping the positive part symbols and dropping 
the factor i/ Xl >u, then it is not increased, so that 

QV(x) < £i(Ai — \x\ — uv) +X2(A2 + uv — H2) + K 

< —(x\ + X2) min{/ii + uv — Xi, fi2 — ^2 — Ul/ } + K (12.12) 

where K = — — 2 ^ ^ 2 — - . To get the best bound on X\ + X2 , we select u to maximize the min 
term in (12.12), or u = u*, where u* is the point in [0, 1] nearest to M 2 ^ 1_ 2 ■ F° r u — u *7 we 
find QV(x) < — e(a?i + ^2) + K where e = min{/xi + v — Ai,/X2 — A2, ^ x ^ 2 2 1- 2 }- Which of the 
three terms is smallest in the expression for e corresponds to the three cases u* = 1, u* =0, and 
< u* < 1, respectively. It is easy to check that this same e is the largest constant such that the 
stability conditions (with strict inequality relaxed to less than or equal) hold with (Ai, A2) replaced 
by (Ai + e,A 2 + e). 

7.2 Lack of sample path continuity of a Poisson process 

(a) The sample path of A^ is continuous over [0, T] if and only if it has no jumps in the in- 
terval, equivalently, if and only if N(T) = 0. So P(N is continuous over the interval [0,T] ) = 
exp( — AT). Since {TV is continuous over [0, +00)} = (1^ =1 {N is continuous over [0,n]}, it follows 
that P(N is continuous over [0, +00)) = linin^oo P(N is continuous over [0, n]) = linin^oo e~ Xn = 
0. 

(b) Since P(N is continuous over [0, +00)) 7^ 1, N is not a.s. sample continuous. However iV is 
m.s. continuous. One proof is to simply note that the correlation function, given by Rpj(s,t) = 
X(s A t) + X 2 st, is continuous. A more direct proof is to note that for fixed t, E[\N 3 — N t \ 2 ] = 
X\s — t\ + X 2 \s — t\ 2 — > as s — > t. 

7.4 Some statements related to the basic calculus of random processes 

(a) False, lim^oo j L X s ds = 2/ E[Z] (except in the degenerate case that Z has variance zero). 

(b) False. One reason is that the function is continuous at zero, but not everywhere. For another, 
we would have Var(X x - X - X 2 ) = 3R X (0) - 4R X (1) + 2R X (2) = 3 - 4 + = -1. 

(c) True. In general, Rx'x{t) = R' x {t). Since Rx is an even function, R' x {§) = 0. Thus, for 
any t, E[X' t X t ] = Rx'x(0) = R' x (0) = 0. Since the process X has mean zero, it follows that 
Cov(X^, Xt) = as well. Since X is a Gaussian process, and differentiation is a linear operation, 
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Xt and X[ are jointly Gaussian. Summarizing, for t fixed, X[ and AQ are jointly Gaussian and 
uncorrelated, so they are independent. (Note: X' s is not necessarily independent of X t if s ^ t. ) 



7.6 Cross correlation between a process and its m.s. derivative 



Fix t, u £ T. By assumption, \\~m s ^t —zr = X[ m.s. Therefore, by Corollary 2.2.4, E 
E[X[X U ] as s — > t. Equivalently, 



X s -Xt 



I I X u 



R x (s,u) - Rx(t,u) 



s-t 



Rx'x(t, u) as s — > t. 



Hence diRx(s,u) exists, and d\Rx{t,u) = Rx>x(t,u). 

7.8 A windowed Poisson process 

(a) The sample paths of X are piecewise constant, integer valued with initial value zero. They 
jump by +1 at each jump of N, and jump by -1 one time unit after each jump of N. 

(b) Method 1: If \s—t\ > 1 then X s and Xt are increments of N over disjoint intervals, and are there- 
fore independent, so Cx{s, t) = 0. If \s—t\ < 1, then there are three disjoint intervals, Iq, h, and I 2 , 
with 7 = [s,s+ l]U[t,t + l], such that [s,s+l] = IqLSIi and [t,t+l] = IqL)I 2 . Thus, X s = D + Di 
and Xt = Dq + D 2 , where Dt is the increment of N over the interval I{. The three increments 
D\, D 2 , and D3 are independent, and Dq is a Poisson random variable with mean and variance equal 
to A times the length of Iq, which is 1 — \s — t\. Therefore, Cx(s,t) = Cov(Dq + D\,Dq + D 2 ) = 

A(l-|s-t|) if|a-t|<l 



Co\{Dq,Dq) = A(l — \s — t\). Summarizing, Cx(s,t) = , 

Method 2: C x {s, t) = Cov(N s+ i - N s , N t+l - N t ) = A[min(s + 1, t + 1) - min(s + 1, t) - min(s, t + 
1) — min(s,t)]. This answer can be simplified to the one found by Method 1 by considering the 
cases |s — 1\ > 1, t < s < t + 1, and s < t < s + 1 separately. 

(c) No. X has a -1 jump one time unit after each +1 jump, so the value Xt for a "present" time t 
tells less about the future, (X s : s>t), than the past, (X s : < s < t), tells about the future . 

(d) Yes, recall that Rx(s,t) = Cx(s,t) — /J>x(s)/J,x(t)- Since Cx and \xx are continuous functions, 
so is Rx, so that X is m.s. continuous. 

(e) Yes. Using the facts that Cx(s,t) is a function of s — t alone, and Cx{s) — > as s — > 00, we 
find as in the section on ergodicity, Var(| f Q X s ds) = | J (l — j)Cx(s)ds — > as t — > 00. 

7.10 A singular integral with a Brownian motion 

(a) The integral J ^dt exists in the m.s. sense for any e > because Wt/t is m.s. continuous over 
[e, 1]. To see if the limit exists we apply the correlation form of the Cauchy criteria (Proposition 
2.2.2). Using different letters as variables of integration and the fact R w (s,t) — sAt (the minimum 
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of s and t), yields that as e, e' — > 0, 



£ 



l (-1 

S Li t 



1 /•! 



e J € 
1 /•! 



JQ 



s At 
i st 
s At 



st 



dsdt 
dsdt 



1 [ l s At f 1 /"* s 

= 21 dsdt = 2 / / —dsdt 



1 rt 



JO 



1 

dsdt = 2 / ldt = 2. 



o 



Thus the m.s. limit defining the integral exits. The integral has the N(0, 2) distribution. 

(b) As a, b — > oo, 



£ 



— ds / — (it 
l s Ji t _ 



" rb s A t 

i J\ St 

OO /"OO 



dsdt 



l J\ 



s At 



st 



-dsdt 



00 /"* s A t f°° f f s 

2 / / dsdt = 2 / / —dsdt 

'i 7i st J 1 J x st 



oo rt 



1 ./l 



L , , f°° t- 1 , 

-dsdt = 2 / dt = oo, 



so that the m.s. limit does not exist, and the integral is not well defined. 

7.12 Recognizing m.s. properties 

(a) Yes m.s. continuous since Rx is continuous. No not m.s. differentiable since -R^(O) doesn't 
exist. Yes, m.s. integrable over finite intervals since m.s. continuous. Yes mean ergodic in m.s. 
since Rx(T) — > as \T\ — > oo. 

(b) Yes, no, yes, for the same reasons as in part (a). Since X is mean zero, Rx(T) = Cx{T) for all 
T. Thus 



lim C X (T) 

|THoo 



lim R X (T) = 1 

|THoo 



Since the limit of Cx exists and is net zero, X is not mean ergodic in the m.s. sense. 

(c) Yes, no, yes, yes, for the same reasons as in (a). 

(d) No, not m.s. continuous since Rx is not continuous. No, not m.s. differentiable since X is 
not even m.s. continuous. Yes, m.s. integrable over finite intervals, because the Riemann integral 
j j Rx(s,t)dsdt exists and is finite, for the region of integration is a simple bounded region and 
the integrand is piece- wise constant. 
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(e) Yes, m.s. continuous since Rx is continuous. No, not m.s. differentiable. For example, 

2" 



E 



Xt-Xi 



t 



1 

I 2 
l 
7 2 



[Rx(t, t) - R x (t, 0) - R x (0, t) + R x (0, 



Vi-0-0 + 



+ oo as t — > 0. 



Yes, m.s. integrable over finite intervals since m.s. continuous. 

7.14 A stationary Gaussian process 

(a) No. All mean zero stationary, Gaussian Markov processes have autocorrelation functions of the 
form Rx(t) = Ap' f ', where A > and < p < 1 for continuous time (or \p\ < 1 for discrete time). 

■jjr. The error is Gaussian with mean zero and variance 
0.99. So P{\X 3 - E[X 3 \X }\ > 10} = 2Q(- 10 ^ 



(b) £[A 3 |A ] = E[X 3 \X ] 
MSE = Var(A 3 )-Var(^, 

(c) R x .{r) = -B» X V- ^ 



Rx (3) v~ 

RxW) X ° 

1-0.01 



/0.99' 



In particular, since — R' x exists and is continuous, X is continu- 



" (l+r 2 )3' 
ously differentiable in the m.s. sense. 

(d) The vector has a joint Gaussian distribution because A is a Gaussian process and differ- 



entiation is a linear operation. Cov(At-,Aq) = Rxx'{t) 



Cov(X ,X' ) 



the N 




and Cov(Ai,A^ 
1 0.5 
2 0.5 
0.5 0.5 1 



= 0.5. Also, Var(A^) 
distribution. 



-R'x(r) 
Rx>(0) 



2t 



- (i +T 2)2 - In particular, 
2. So (A ,A^,Ai) T has 



7.16 Correlation ergodicity of Gaussian processes 

Fix h and let Yt = X t +hXt- Clearly Y is stationary with mean py 



Rx{h). Observe that 



C Y (r) 



p Y 



E[Y T Y 

E[X T+h X T X h X ] - R x (h) 2 

R x {hf + R x (t)R x (t) + R x (r + h)R x (r 



Therefore, Cy{t) — > as \r\ 



h)-Rx(h) 2 
oo. Hence Y is mean ergodic, so X is correlation ergodic. 



7.18 Gaussian review question 

(a) Since X is Markovian, the best estimator of X2 given (Ao,Ai) is a function of X\ alone. 
Since A is Gaussian, such estimator is linear in X\. Since A is mean zero, it is given by 
Cov(A 2 ,Ai)Var(Ai)- 1 Ai = e~ x X x . Thus £[A 2 |A ,Ai] = e~ x X x . No function of (A ,Ai) is 
a better estimator! But e~ x X\ is equal to p(Ao,Ai) for the polynomial p(xo,xi) = x\/e. This 
is the optimal polynomial. The resulting mean square error is given by MMSE = Var(A2) — 
(Cov(A 1 A 2 ) 2 )/Var(A 1 ) = 9(1 - e" 2 ) 



(b) Given (A = n,X 1 = 3), A 2 is N (3e"\ 9(1 - e" 2 )) so 



i z> (A 2 >4|A = 7T,A 1 = 3) 



A 2 - 3e" 



vw 



> 



4 - 3c" 1 

^9(1 - e~- 



Q 



3e" 



^9(1 - e" 2 ) 
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7.20 KL expansion of a simple random process 

(a) Yes, because Rx{t) is twice continuously differentiable. 

(b) No. lim^oof J *(^)Cx(r)dT = 50 + linn,^ ±f J *(^) cos(207rr)(ir = 50/0. Thus, the 
necessary and sufficient condition for mean ergodicity in the m.s. sense does not hold. 

(c) APPROACH ONE Since i?x(0) = i?x(l)> the process X is periodic with period one (actually, 
with period 0.1). Thus, by the theory of WSS periodic processes, the eigen-functions can be taken 
to be 4>n{t) = e 271 "- 7 ™' for neZ. (Still have to identify the eigenvalues.) 

APPROACH TWO The identity cos((9) = \(e j6 + e~J ), yields 

R x (s - t) = 50 + 25e 207r ^ s -*) + 2be~ 2 ^^ s -^ = 50 + 25e 20 ^ s e- 20 ^ t + 25e- 2 ° 7T ^e 20 ^ t 

= 50Ms)<P* {t) + 25<t> 1 (s)<f>* 1 (t) + 25Ms)<P* 2 {t) for the choice <f> (t) = 1, ^(t) = e 207TJt and <p 2 = 

e -207rjt_ rp^ e e jg enva i ues are thus 50,25, and 25. The other eigenfunctions can be selected to fill 

out an orthonormal basis, and the other eigenvalues are zero. 

APPROACH THREE For s, t G [0, 1] we have R x (s, t) = 50 + 50 cos(20tt(s - t)) 

= 50 + 50 cos(20tts) cos(207rf) + 50 sin(207rs) sin(207ri) = 50(/>o(s)$5(*) + 254> 1 (s)cf>* 1 (t) + 2h4> 2 {s)(j)* 2 {t) 

for the choice (f>o(t) = 1) ^i(^) — v2cos(207rf) and <p2 — v2sin(20-7ri). The eigenvalues are thus 

50,25, and 25. The other eigenfunctions can be selected to fill out an orthonormal basis, and the 

other eigenvalues are zero. 

(Note: the eigenspace for eigenvalue 25 is two dimensional, so the choice of eigen functions spanning 

that space is not unique.) 

7.22 KL expansion for derivative process 

(a) Since (p' n (t) = (2irjn)(f> n (t), the derivative of each <fi n is a constant times <p n itself. Therefore, the 
equation given in the problem statement leads to: X'(t) = J2 n {X, (f) n ) (p' n (t) = J2 n [(2irjn)(X,(j) n )](f) n (t), 
which is a KL expansion, because the functions <p n are orthonormal in L 2 [0, 1] and the coordinates 
are orthogonal random variables. Thus, 

tp n (t) = <f> n (t), (X',tfj n ) = (2TTJn)(X n ,(f) n ), and fi n = (2-Kn) 2 X n for n G Z 

(Recall that the eigenvalues are equal to the means of the squared magnitudes of the coordinates.) 

(b) Note that 4> x = 0, <j) 2 k(f) = ~(2irk)(j)2k+i(t) and <t>2k+\(f) = (2irk)(p2k(t). This is similar to part 
(a). The same basis functions can be used for X' as for X, but the (2k) and (2k + 1) coordinates 
of X' come from the (2k + l) th and (2k) th coordinates of X, respectively, for all k > 1. Specifically, 
we can take 

Mt) = 4>n(t) for n > 0, (X', y> ) = mo = 0, 

(X',ijj2k) = {2-irk){X,(j)2k+i), M2fc = (27rfc) 2 A 2fc +i, 

(X',ip2k+i) = -(2irk){X,(j)2k), l^2k+i = (2TTk) 2 X 2 k, for fc > 1 

(It would have been acceptable to not define ^o, because the corresponding eigenvalue is zero.) 

(c) Note that fl n (t) = (2 "+ 1)7r Vn(*), where %j, n {t) = V2 cos ( ^ 2n +^ t \ , n > 0. That is, Vn is 
the same as <f) n , but with sin replaced by cos . Or equivalently, by the hint, we discover that V>n is 
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obtained from <f) n by time-reversal: ip n (t) = (f> n (l — t)(—l) n . Thus, the functions tp n are orthonormal. 



As in part (a), we also have (X' , ip n ) 



(2n+l)7r 



(2rt+l)7TN,2 



A„. (The set 



— — (X,(p n ), and therefore, fj, n = ( — — 
of eigenfunctions is not unique-for example, some could be multiplied by -1 to yield another valid 
set.) 
(d) Differentiating the KL expansion of X yields 

x' t = <x,^i)0i(«) + (x,<h)<&{t) = (x^jaVs-ix^faVz. 

That is, the random process X' is constant in time. So its KL expansion involves only one nonzero 
term, with the eigenfunction tpi(t) = 1 for < t < 1. Then (X',ipi) = (X, </>i)ci\/3 — (X, <?!>2)c2\/3, 
and therefore /ii = 3c 2 Ai + 3c?,A2. 

7.24 Mean ergodicity of a periodic WSS random process 

- f X u du =- I Y^X n e 2 ^ nu l T du = J2a n , t X n 

1 ft I-Kjnu/T J I _ i e 2^nt/T_ 

t Jo e uu| — | 2-KJnt/T 



where ao = 1, and for n / 0, |a n) i 
not important as £ — > oo. Indeed, 



T 

nnt ' 



< - J —;. The n/0 terms are 



£ 



ngZ,n^0 



2i 



T 



y~] \a n , t \ 2 px{n) < -jTa X] P*( n ) "^ ° as * ""> °° 



ngZ,n^0 



ngZ.n^O 



Therefore, ^ J X u du — > Xo m.s. The limit has mean zero and variance px(0). For mean ergodic- 
ity (in the m.s. sense), the limit should be zero with probability one, which is true if and only if 
Px(0) — 0. That is, the process should have no zero frequency, or DC, component. (Note: More 
generally, if X were not assumed to be mean zero, then X would be mean ergodic if and only if 
Var(Xo) = 0, or equivalently, px{0) = Mx> or equivalently, Xq is a constant a.s.) 

8.2 On the cross spectral density 

Follow the hint. Let U be the output if X is filtered by H e and V be the output if Y is filtered by 
H e . The Schwarz inequality applied to random variables Ut and Vt for t fixed yields |i?f/y(0)| 2 < 
Ru(0)Rv(0), or equivalently, 



c ( \ duJ \ 



doj 



< I,„ SX{ ^J J , S ^2,- 



doj 



which implies that 

\cSxy(co ) + o(e)| 2 < (eS x (u ) + o(e))(eS Y (u ) + o(e)) 
Letting e — > yields the desired conclusion. 
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8.4 Filtering a Gauss Markov process 

(a) The process Y is the output when X is passed through the linear time-invariant system with 
impulse response function h{r) = e _T 7r T> Qi. Thus, X and Y are jointly WSS, and 

Rxy(t) = R X * h(r) = /~-oo Rx(t)h(r - t)dt = Ho Rx(t)h{t - r)dt = { ( i L^ \ \ \ 

(b) X<z and I5 are jointly Gaussian, mean zero, with Var(Xs) = Rx(0) = 1, and Cov(l5,.X"5) = 
Rxy(0) = |, so E[Y 5 \X 5 = 3] = (Cov(Y 5 ,X 5 )/Var(X 5 ))3 = 3/2. 

(c) Yes, Y is Gaussian, because X is a Gaussian process and Y is obtained from X by linear oper- 
ations. 

which does not have the 



(d) No, Y is not Markov. For example, we see that Sy(u>) = ,-,, 2-12 1 

form required for a stationary mean zero Gaussian process to be Markov (namely 2 2 ). Another 

explanation is that, if t is the present time, given Yt, the future of Y is determined by Y and 

(X s : s > t). The future could be better predicted by knowing something more about Xt than Y 

gives alone, which is provided by knowing the past of Y. 

(Note: the Revalued process ((X t ,Y t ) : t G R) is Markov.) 



8.6 A stationary two-state Markov process 

7rP = 7r implies n = (j, 5) is t ne equilibrium distribution so P{X n = 1} = P{X n 
all n. Thus fix = 0. For n > 1 



-1} = \ for 



#*(") 



P(X n = l,X = l) + P(X n 



■1,X, 







-1) 



5 + ^-^ 



+ 



1 



\ + V 1 -^ 



P(X n 

"1 1 



-i,x = i)-p(x n = i,x Q = -i) 



;(l-2p) r 



1 1 



1 
2 

= (l-2p) n 

So in general, Rx(n) = (1 — 2py n '. The corresponding power spectral density is given by: 



(1 - 2 P y 



S x (o 



E (! - 2 P)" e " 



-jam 



-J"J\n 



+ ^((1 " 2p)e*T - 1 



n=0 



+ 



1 



E((! - 2 ^) ( 

n=0 

1 

1 - (1 - 2p)e"^ ' 1 - (1 - 2p)e> 

1 - (1 - 2p) 2 

1 - 2(1 - 2p) cos(w) + (1 - 2p) 2 



8.8 A linear estimation problem 



E[\X t -Z t \ 2 ] = E[(X t -Z t )(X t -Z t )*} 

= R x (0) + Rz(0) - Rxz(0) - Rzx(0) 

= R x (0) + h*h*R Y (0)-2Re(h*R X Y(0)) 

S x (u) + \H(u)\ 2 S Y (u) ~ 2Re(H*(u)SxY(w)) 



duj 
2~i 
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The hint with <r z 



S y (uj) 



S(xy(w), and z = H(uj) implies H Q ^{uj) 



8.10 The accuracy of approximate differentiation 

(a) S xf (u) = Sx(u)\H(lj)\* = oj 2 S x {oj). 

(b) fc(r) = ±{5{t + a) - 5(r - a)) and K(u) = /" k{r)e 



juJt di 



2a 



(e 



j™ 



Sxy(u) 

SyH ■ 



-jua\ _ j sin(aai) 



•By 



l'Hospital's rule, lim a ^o K(u>) = lim a ^o Juc ^ au) = juj. 

(c) D is the output of the linear system with input X and transfer function H{to) — K{uo). The 

output thus has power spectral density Sd(u>) = Sx(w)\H(u) — K(u>)\ 2 = Sx(w)\w — sm ^ UJ > | 2 , 



(d) Or, S D (u>) = S x >(u)\l 



sin(aa;) |2 



. Suppose < a < 



/U6 



(' 



in the problem statement, if \u\ < uj then < 1 



sin(aa>) 



< 



^p). Then by the bound given 

(cilu) 2 -. (aw ) 2 



< 



< 0.1, so that 



6—6 

Sd(u) < (O.Ol)S'x'(w) for oj in the base band. Integrating this inequality over the band yields that 



E[\D t \ 2 ] < (0M)E[\X, 



/|2l 



8.12 Filtering Poisson white noise 

(a) Since fi^-i = A, \xx = A J^° h(t)dt. Also, Cx = h * h * CV' = Xh * h. (In particular, if 
h(t) = I{o<t<i}, then Cxij) = A(l — |t|) + , as already found in Problem 4.17.) 

(b) In the special case, in between arrival times of N, X decreases exponentially, following the 
equation X' = —X. At each arrival time of N, X has an upward jump of size one. Formally, we can 
write, X' = —X + N' . For a fixed time t , which we think of as the present time, the process after 
time t is the solution of the above differential equation, where the future of N' is independent of 
X up to time t . Thus, the future evolution of X depends only on the current value, and random 
variables independent of the past. Hence, X is Markov. 



8.14 Linear and nonlinear reconstruction from samples 

(a) We first find the mean function and autocorrelation function of X. E[Xt] = ^2 n E[g(t — n 
U)]E[B n ] = because E[B n ] = for all n. 



Rx(s,t) = E 



JT g(s-n-U)B n J2 g(t-m-U)B n 



,n=— oo 

oo 



a 2_, E[g(s — n — U)g(t — n — U)\ = a Y^ / g(s — n — u)g(t — n — u)du 



n=— oo 

00 rn+1 



a 



£ 

n=— oo 

"OO 



g(s — v)g[t — v)dv = a / g(s — v)g(t — v)dv 



g(s - v)g(v - t)dv = a (g * g)(s - t) 



So X is WSS with mean zero and Rx = o~ 2 g * g. 

(b) By part (a), the power spectral density of X is o" 2 |G(a;)| 2 . If g is a baseband signal, so that 
|G(u;) 2 | = for u > u . then by the sampling theorem for WSS baseband random processes, X can 
be recovered from the samples {X(nT) : n £ Z) as long as T < jj-. 
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(c) For this case, G(2irf) = sine (/), which is not supported in any finite interval. So part (a) does 
not apply. The sample paths of X are continuous and piecewise linear, and at least two sample 
points fall within each linear portion of X. Either all pairs of samples of the form (X n , X n +o.5) fall 
within linear regions (happens when 0.5 < U < 1), or all pairs of samples of the form (X n+ o.5, X n+ \) 
fall within linear regions (happens when < U < 0.5). We can try reconstructing X using both 
cases. With probability one, only one of the cases will yield a reconstruction with change points 
having spacing one. That must be the correct reconstruction of X. The algorithm is illustrated 
in Figure 12.4. Figure 12.4(a) shows a sample path of B and a corresponding sample path of X, 




2 3 4 5 6 



,-23456 



■2 3 4 5 6 



Figure 12.4: Nonlinear reconstruction of a signal from samples 

for U = 0.75. Thus, the breakpoints of X are at times of the form n + 0.75 for integers n. Figure 
12.4(b) shows the corresponding samples, taken at integer multiples of T = 0.5. Figure 12.4(c) 
shows the result of connecting pairs of the form (X n ,X n+ o^), and Figure 12.4(d) shows the result 
of connecting pairs of the form (X n+ o.s, X n+ i). Of these two, only Figure 12.4(c) yields breakpoints 
with unit spacing. Thus, the dashed lines in Figure 12.4(c) are connected to reconstruct X. 



8.16 An approximation of white noise 



(a) Since E[B k Bj>] 
E 



2 {k=l}, 

1 2 

N t dt 







AtT^Tb, 



2 
k 



(A T T) 2 E 



\2„2 jv- /l2 rpjl 



K 


K 1 


E 


B k 2J B l 


fc=i 


1=1 J 



= {A T T) z a z K = A z T To 2 
(b) The choice of scaling constant At such that A T T = 1 is At 



Under this scaling the 



process N approximates white noise with power spectral density a 2 as T — > 0. 

(c) If the constant scaling At = 1 is used, then E[\ L Ntdt\ 2 ] = To 2 — > as T — > 0. 

8.18 Filtering to maximize signal to noise ratio 

The problem is to select H to minimize a 2 J^° \H(uj)\ 2 ^, subject to the constraints (i) |-ff(w)| < 1 
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for all u, and (ii) J^° \co\ \H(lu)\ 2 j^ > (the power of X)/2. First, H should be zero outside of 
the interval [— u ,U) ], for otherwise it would be contributing to the output noise power with no 
contribution to the output signal power. Furthermore, if |i^(u;)| 2 > over a small interval of length 
27re contained in [w ,u; ], then the contribution to the output signal power is |w||i^(io>)| 2 e, whereas 
the contribution to the output noise is a 2 \H{ui)\ 2 e. The ratio of these contributions is |w|/o" 2 , which 
we would like to be as large as possible. Thus, the optimal H has the form H(oj) = Is a <\uj\<uj }i 
where a is selected to meet the signal power constraint with equality: (power of X) = (power of 
X)/2. This yields a = to /-\/2. In summary, the optimal choice is \H(co)\ 2 = I, / v /2<i a ,i< w y 

8.20 Sampling a signal or process that is not band limited 

(a) Evaluating the inverse transform of x° at nT, and using the fact that 2u) T = 2tt yields 
*°{nT) = Ho efr^uOg = Zl a ^ nT H^ = £™=_oo IZ ^^ + ^)8 

= £™=-oo C-iK eJ0JnT ^) t = Too ^ nT *M t = *m • 

(b) The observation follows from the sampling theorem for the narrowband signal x° and part (a). 

(c) The fact R x (nT) = R° x (nT) follows from part (a) with x(t) = R x (t). 

(d) Let X° be a WSS baseband random process with autocorrelation function R° x . Then by the 
sampling theorem for baseband random processes, X° = EnL-oo -^nT sinc ( t ~^ ) • But the discrete 
time processes {X n T : n £ Z) and (X° T :neZ) have the same autocorrelation function by part 
(c). Thus X° and Y have the same autocorrelation function. Also, Y has mean zero. So Y is WSS 
with mean zero and autocorrelation function R° x . 

(e) For < u < ui°, S x (u) = E^°=-oo exp(-a|o; + 2nco \) 

= E^°=o exp(-a(w + 2niv )) + J2n=-i exp(a(w + 2nuj )) 

exp(— aui)+exp(a(u>— 2u> )) exp(— a(u>— <x> ))+exp(a(w— uio)) cosh(a(uj —w)) 

1— exp(— 2au) ) cxp(aw )— exp(— au> )) s\iih(au> ) 

cosh(a(w - |w|)) 

b x {UJ) = 1{\uj\<uj }- 



Thus, for any 



sinh(acu ) 



8.22 Another narrowband Gaussian process 

(a) 



/oo 
h(t)dt = fx R H(o) = o 
-oo 

S x (27rf) = \H(2nf)\ 2 S R (2Trf) = 10- 2 e-l/l/ lo4 / 5000 <| / |< 6000 
(b) 

/•oo 9 /-6000 

Rx(0) = / S x (2irf)df = -j / e"'/ 10 # = (200)(e-°- 5 - e" ' 6 ) = 11.54 



X 25 ~ iV(0, 11.54) so P{X 25 > 6} = Q ( —JL=) « Q(1.76) « 0.039 

Vvll.54/ 

(c) For the narrowband representation about f c = 5500 (see Figure 12.5), 
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s x 






For f = 5500 






S U= 


s v 


-id" 
s uv 


x y s JJV 


For f = 5000 | """" 


C - 














1 -^ 





Figure 12.5: Spectra of baseband equivalent signals for f c = 5500 and f c = 5000. 



Su(2irf) = S v (2vf) = 10- 



e -(/+5500)/10 4 + e -(-/+5500)/10 4 



I\ 



|/|<500 



.55 



50 



cosh(//10 4 )/| / |<5oo 



Suv(2*f) 



10" 



j e -(/+5500)/10 4 _ J - e -(-/+5500)/10 4 



-je 



.55 



/iz-K-Knn = — '— sinh(//10 4 )/|/|<500 



'|/|<500 



50 



(d) For the narrowband representation about f c = 5000 (see Figure 12.5), 

Su(2*f) = S v (2nf) = 10- 2 e - 5 e-l^/ lo4 /| / |< 100 o 

Suv(2irf) = jsgn(f)Su(27rf) 



8.24 Declaring the center frequency for a given random process 

(a) Su(u) = g{u + u c ) + g(-u + u c ) and S uv (u) = j(g(u> + u c ) - g(-u + u c ))- 

(b) The integral becomes: 

J^g^ + uJ+gi-uJ+uJc))^ = 2/_ 00 00 5M 2 ^+2/_ 00 00 ^+^)5(-^+^)^ = 2\\g\\^+g*g(2co c ) 
Thus, select uj c to maximize g * g(2u c ). 

9.2 A smoothing problem 

Write X§ = L g(s)Y s ds + L g(s)y s ds. The mean square error is minimized over all linear estima- 
tors if and only if (X5 — X5) _L Y u for u £ [0, 3] U [7, 10], or equivalently 

/•3 /-10 

R XY (5,u) = g(s)R Y (s,u)ds+ g(s)R Y (s,u)ds for u € [0,3] U [7, 10]. 
Jo h 
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9.4 Interpolating a Gauss Markov process 

(a) The constants must be selected so that Xq — Xq _L X a and Xq — Xq _L X_ a , or equivalently 
e~ a — [c\e~ 2a + C2] = and e~ a — [c\ + C2e~ 2a ] = 0. Solving for c\ and C2 (one could begin by 
subtracting the two equations) yields c\ = ci = c where c = ,f_ _ 2a = e a +e - a — 2coshfaT' ^ ne corre ~ 
sponding minimum MSE is given by E[X$] - E[X$] = 1 - c 2 E[(X_ a + X a ) 2 ) = 1 - c 2 (2 + 2e~ 2a ) = 

e 2a_ e -2a _ {e a _ e -a ){e a + e ~a ) 
( e a +e -a)2 - ( e a +e -a)2 — IdJUHOj. 

(b) The claim is true if (Xo — Xo) _L X u whenever \u\ > a. If u > a then 

£[(X - c(X_ a + X a ))X u ] = e-« - i ^r(e- a - n + e a +«) = 0. Similarly if u < -a then 

£[(X - c{X_ a + X a ))X u ] = e u - ~^E{e a+u + e~ a+u ) = 0. The orthogonality condition is thus 

true whenever \u\ > a as required. 

9.6 Proportional noise 

(a) In order that nYt be the optimal estimator, by the orthogonality principle, it suffices to check 
two things: 

1. nYt must be in the linear span of (Y u : a < u < b). This is true since t G [a, b] is assumed. 

2. Orthogonality condition: {Xt — nYt) _L Y u for u G [a,b] 

It remains to show that K can be chosen so that the orthogonality condition is true. The condition is 
equivalent to E[(X t — nY t )Y*] = for u G [a, b], or equivalently RxY(t,u) = nRy(t, u) for u G [a, b]. 
The assumptions imply that Ry = Rx + Rn = (1 + l 2 )Rx and Rxy = Rx, so the orthogonality 
condition becomes Rx(t,u) = k(1 + ^ 2 )Rx{t,u) for u G [a,b], which is true for k = 1/(1 + 7 2 ). 
The form of the estimator is proved. The MSE is given by E[\X t - X t \ 2 ) = E[\X t \ 2 ) - E[\X t \) 2 = 
^Rx(t,t). 

(b) Since Sy is proportional to Sx, the factors in the spectral factorization of Sy are proportional 
to the factors in the spectral factorization of X: 



S Y = (1 + 1 2 )S x = (Vl + 7 2 ^) (Vl + 7 2 S 



'x 



S' 1 



That and the fact Sxy = Sx imply that 

°J" T SxY 






O-y 



V^T^s; 



X 



e^ T S x 



V^+ 



Y 



S$(u>) 



[e J " T S x (")]. 



Therefore H is simply k times the optimal filter for predicting Xt+T from {X s : s < t). In par- 
ticular, if T < then H{uj) = Ke 3 ^, and the estimator of Xt+T is simply X i+T u = K,Y t +T, which 
agrees with part (a). 

(c) As already observed, if T > then the optimal filter is k times the prediction filter for Xt+T 
given (X s : s <t). 
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9.8 Short answer filtering questions 

(a) The convolution of a causal function h with itself is causal, and H 2 has transform h * h. So if 
H is a positive type function then H 2 is positive type. 

(b) Since the intervals of support of Sx and Sy do not intersect, <Sx(27r/)Sy(27r/) = 0. Since 
|5 X y(27r/)| 2 < 5 x (27r/)5' F (27r/) (by the first problem in Chapter 6) it follows that S X Y = 0. 
Hence the assertion is true. 

(c) Since sinc(/) is the Fourier transform of L_i ii, it follows that 

1 -TTJf/2 ■ I J 

sine 



[H\+(2irf) = f 2 e~ 27rfjt dt = i e -^'// 2 
Jo 2 



9.10 A singular estimation problem 

(a) E[X t ] = E[A]e j27rfot = 0, which does not depend on t. 

R x (s,t) = E[Ae j27:faS (Ae j27Tfat )*] = a 2 A e j2n ^ s ~^ is a function of s - t. 

Thus, A is WSS with fix = and R x (t) = a\e^^" T . Therefore, S x (2irf) = a\5(f-f ), or equiv- 

alently, S x (u) = 2ira 2 A 6(u-u ) (This makes R x (r) = f^ S x (27rf)e^^df = f^ S x (co)e^ T ^.) 

(b) (h * X) t = X!^ h( T )X t _ T dr = / °° ae- a -> 2 «f°> r Ae> 2 "M t - T UT = J °° ae-^drAe^^ = X t . 
Another way to see this is to note that A is a pure tone sinusoid at frequency f , and H(2Ttfo) = 1. 

(c) In view of part (b), the mean square error is the power of the output due to the noise, or 

MSE=(h*h*R N )(0) = f^ifohXQRNiO-Qdt = a%h*h(0) = a 2 N \\h\\ 2 = a 2 N / °° a 2 e~ 2at dt = ^f. 
The MSE can be made arbitrarily small by taking a small enough. That is, the minimum mean 
square error for estimation of Xt from (Y s : s < t) is zero. Intuitively, the power of the signal A is 
concentrated at a single frequency, while the noise power in a small interval around that frequency 
is small, so that perfect estimation is possible. 

9.12 A prediction problem 

The optimal prediction filter is given by — p \e J S x ] . Since Rx{t) = e~' T ', the spectral factoriza- 

tion of Sx is given by 




S x S x 

so [ei ujT S x ] + = e~ T S x (see Figure 12.6). Thus the optimal prediction filter is H{ui) = e~ T , or in 
the time domain it is h(t) = e~ T S(t), so that X T+t u = e~ T Xt- This simple form can be explained 
and derived another way. Since linear estimation is being considered, only the means (assumed zero) 
and correlation functions of the processes matter. We can therefore assume without loss of gener- 
ality that A is a real valued Gaussian process. By the form of Rx we recognize that A is Markov 
so the best estimate of A^ +t given (A s : s < t) is a function of Xt alone. Since A is Gaussian with 

mean zero, the optimal estimator of X t +T given X t is E[Xt+x\ Xt] = V t+ nc \ " = e ~ T ^t- 



(t+T) 




Figure 12.6: \/2e? S~x in the time domain 
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9.14 Spectral decomposition and factorization 

(a) Building up transform pairs by steps yields: 



sinc(/) 
sinc(100/) 

sinc(100/)e 2 ^ /T 

sinc(100/)e j2 ^ /T 



10" 2 /. 



so 



l 2— 100 — 2 J 

If 1 t + T < l| 
I 2 — 100 — 2 J 

10 /{_50-T<t<50-T}n{t>0} 



10" 



10- 4 length of ([-50 - T, 50 - T] n [0, +oo)) = { 10" 4 (50 - T) 





T < -50 

-50 < T < 50 

T> 50 



(b) By the hint, 1 + 3j is a pole of S. (Without the hint, the poles can be found by first solving 
for values of to 2 for which the denominator of 5 is zero.) Since S is real valued, 1 — 3j must also 
be a pole of S. Since S is an even function, i.e. S(u>) = S(—u), —(1 + 3j) and —(1 — 3j) must also 
be poles. Indeed, we find 



S(u) 



1 



(w - (1 + 3j))(w - (1 - 3j))(w + 1 + 3j)(w + 1 - 3j) • 
or, multiplying each term by j (and using j 4 = 1) and rearranging terms: 

S(u/) ' ' 



or S + (uj) 
constant. 



l 

(juj 2 )+6JLj+10 



(ju + 3 + j)(ju + 3 - j) (-ju + 3 + i)(-jw + 3 - j) 

V v ' V v ' 

s+h s-( w ) 

. The choice of 5 + is unique up to a multiplication by a unit magnitude 



9.16 Estimation of a random signal, using the KL expansion 

Note that (Y, cpj) = (X, (f)j) + (N, <pj) for all j, where the variables (X, <f>j),j > 1 and (N, (f)j),j > 1 
are all mutually orthogonal, with S[|(X, 4>j)\ 2 ] = Xj and E[\(N, (f>j)\ 2 ] = o" 2 . Observation of the 
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process Y is linearly equivalent to observation of ((Y,(pj) : j > 1). Since these random variables 
are orthogonal and all random variables are mean zero, the MMSE estimator is the sum of the 
projections onto the individual observations, (Y, (pj). But for fixed i, only the i observation, 
(Y, (pi) = (X, (f>i) + (N, (pi), is not orthogonal to {X, (pi). Thus, the optimal linear estimator of (X, (pi) 

given Y is V ] fvl ^ ( Y, <Pi ) = I + I ■ The mean square error is (using the orthogonality 

principle): £[|(X,^)| 2 ] - E[\^^\ 2 ] = X t - |gg^ = gg,. 

(b) Since f(t) = ^j(f,(pj)(pj(t), we have (X,f) = 2,(/, <pj)(X, (pj). That is, the random variable 

to be estimated is the sum of the random variables of the form treated in part (a). Thus, the 

best linear estimator of (X, f) given Y can be written as the corresponding weighted sum of linear 

estimators: 

X t (Y,(P t )(f,(P t ) 



(MMSE estimator of (X, f) given Y) 






4^ Xi + cr 2 



The error in estimating (X, f) is the sum of the errors for estimating the terms (/, (pj)(X, (pj), and 
those errors are orthogonal. Thus, the mean square error for {X, f) is the sum of the mean square 
errors of the individual terms: 

(MSE) - V" 



\v 2 \(fAi)\ 2 

^ Xi + a 2 



9.18 Linear innovations and spectral factorization 

First approach: The first approach is motivated by the fact that — p is a whitening filter. Let 

s Y 

7~t(z) = +, and let Y be the output when X is passed through a linear time- invariant system 

with z-transform 7i{z). We prove that Y is the innovations process for X. Since 7i is positive 
type and limui^^ 7i.(z) = 1, it follows that Y\. = X^ + h(\)X\ z _\ + h{2)Xk_2 + • • • Since Sy(z) = 
H(z)H*(l/z*)S x (z) = f3 2 , it follows that R Y (k) = (3 2 I {k=0} . In particular, 

Y k ± linear span of { Y k -i,Y k - 2,- ••} 

Since H and 1/TC both correspond to causal filters, the linear span of {Y k _\, Y k _2, • • • } is the same 
as the linear span of {X k _\, X k _2, • ••}■ Thus, the above orthogonality condition becomes, 

X k - (-h(l)X k -i - h(2)X k - 2 ) L linear span of {X k -i,X k -2, •••} 

Therefore — h(l)X k _i — h(2)X k _2 — ■ ■ ■ must equal -XW_i, the one step predictor for X k . Thus, 
(Y k ) is the innovations sequence for (X k ). The one step prediction error is -E^Y^I 2 ] = i?y(0) = (3 2 . 

Second approah: The filter K, for the optimal one-step linear predictor (X k+1 i k ) is given by (take 
T = 1 in the general formula): 

rc = i[*s£] + . 
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The z-transform zS^ corresponds to a function in the time domain with value (3 at time -1, and 

If X is 



value zero at all other negative times, so [ztv£] + = zS x — z(3. Hence K{z 
filtered using /C, the output at time k is Jffc+iifc. So if X is filtered using 1 



SUz) 



z(3 
S+(z)- 

, the output at 



time k is X, 



fe|fc-i' 



So if X is filtered using H.(z) = 1 — (1 • 



S+(z)> S+(z) 



then the output at time k 



is X, - X. 



error is -Ry(O) = (3 . 



fcifc-i = -^fcj the innovations sequence. The output X has S^(z) = /? , so the prediction 



9.20 A discrete-time Wiener filtering problem 

To begin, 

z t S X y(z) z t 



y T+l 



Sy(z) 



+ 



0(l-p/z)(l-z o p) 0(± - p)(l - ZoZ ) 



The right hand side corresponds in the time domain to the sum of an exponential function supported 
on — T, —T + 1, — T + 2, . . . and an exponential function supported on — T — 1, — T — 2, . . .. If T > 
then only the first term contributes to the positve part, yielding 



H(z) 




;T 



0{l-p/z)(l-Z o p) 



-,T 



/3 2 (1 - z oP )(l - z /z) 
On the other hand if T < then 

z t S X y 



and h{n) 



° -, n T 

P{l-Zo P ) ZoI te»- 



+ 



z(z T - #) 



+ 



z n z 



so 



H(z) 



(3(l-p/z)(l-z p) P(±-p)(l 

z(z T -zj){l-p/z) 



-,T 



+ 



/3 2 (1 - Zo p){\ - z /z) /32( J, - p )(l - Zo z)(l - z /z) ' 



Inverting the z-transforms and arranging terms yields that the impulse response function for the 
optimal filter is given by 



h{n) 



/?2(1 - zl) 



>+T\ 



^ z n S T 



'{n>0}- 



(12.13) 



Graphically, h is the sum of a two-sided symmetric exponential function, slid to the right by — T 
and set to zero for negative times, minus a one sided exponential function on the nonnegative 
integers. (This structure can be deduced by considering that the optimal casual estimator of X t+ x 
is the optimal causal estimator of the optimal noncausal estimator of Xt+r-) Going back to the 
z-transform domain, we find that Ti can be written as 



H(z) 



f3 2 (l-z /z)(l-z z) 



zJ(Zo ~ p) 



P2 {l _ z 2 ){ 2_ ){l _ Zo/z) 



(12.14) 
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Although it is helpful to think of the cases T > and T < separately, interestingly enough, the 
expressions (12.13) and (12.14) for the optimal h and H hold for any integer value of T. 

9.22 Estimation given a strongly correlated process 

(a) R x =g*g <-> S x (z) = g(z)G*(l/z*), 
R Y = k*k^^ S Y (z) = JC(z))C*(l/z*), 

R X Y = g*k~ Sxy(z) = Q(z)K*(\/z*). 

(b) Note that Sy{z) = JC(z) and Sy(z) = K.*(l/z*). By the formula for the optimal causal 
estimator, 



H(z) = ± 



SxY 

~Sy~ 



1 



K{z) 



g{z)K,*{l/z*) 
K*{l/z*) 



[0\+ G{z) 



K K(z) 



(c) The power spectral density of the estimator process X is given by Ti{z)7i* (1 / ' z*)Sy{z) = Sx{z). 
Therefore, MSE = R x (0) - %(0) = J^S x (e juJ )^ - jl^S^e^)^ = 0. A simple explanation 
for why the MSE is zero is the following. Using ^ inverts /C, so that filtering Y with ~ produces 
the process W. Filtering that with Q then yields X. That is, filtering Y with 7i produces X, so 
the estimation error is zero. 

10.2 A covering problem 

(a) Let Xi denote the location of the i th base station. Then F = f{X\, . . . ,X m ), where / satisfies 
the Lipschitz condition with constant (2r — 1). Thus, by the method of bounded differences based 

2 

on the Azuma-Hoeffding inequality, P{\F — E[F]\ > 7} < 2exp( to -i\z )- 

(b) Using the Possion method and associated bound technique, we compare to the case that the 
number of stations has a Poisson distribution with mean m. Note that the mean number of stations 
that cover cell i is m{ - r ~ > unless cell i is near one of the boundaries. If cells 1 and n are covered, 
then all the other cells within distance r of either boundary are covered. Thus, 

P{X > m} < 2P{Poi(m) stations is not enough} 

< 2ne~ m{2r ~ 1)/n + Pjcell 1 or cell n is not covered} 

(1 + e)nlnn 



asmoo if m 



2r- 1 



As to a bound going the other direction, note that if cells differ by 2r — 1 or more then the events 
that they are covered are independent. Hence, 

P{X < m} < 2P{Poi(m) stations cover all cells} 

Tl — 1 

< 2P{Poi(m) stations cover cells 1 + (2r — l)j, 1 < j < } 

n-l 

/ _m(2,- 1 )X^— r 

< 2 1 — e »> 



m(2r-l) U — 1 

< 2exp(— e n 



2r-Y 

(1 — e)nlnn 

asn->oo if m = 

2r- 1 
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Thus, in conclusion, we can take g± (r) = 52 (r) = 2r _i - 

10.4 On uniform integrability 

(a) Use the small set characterization of u.i. First, supj(75[|Xj|] + E'fjlil]) < (sup^ 75[|Xj|]) + 
(supj £?[|1^|]) < 00 Second, given e > 0, there exists 6 so small that 7?[|Xj|/a] < | and -E[|1^|/a] < § 
for all i, whenever A is an event with P{A} < e. Therefore E[(\Xi + Yi\I A ) < E[\X i \I A )+E[\Y i \I A ) < 
e for all i, whenever A is an event with -P{^4} < e. Thus, (Xi + Yi : i £ 7) is u.i. 

(b) Suppose (Xi : i £ 7) is u.i., which by definition means that linic^oo K(c) = 0, were K (c) = 
sup ig j T7[|Xj|7rijq| >c i]. Let c n be a sequence of positive numbers monotonically converging to +00 
so that K(c n ) < 2~ n . Let (f(u) = E^Li( n — c n)+- The sum is well defined, because for any u, only 
finitely many terms are nonzero. The function if is convex and increasing because each term in the 
sum is. The slope of <p(u) is at least n on the interval (c n , 00) for all n, so that liniu^oo ^^ = +00. 
Also, for any i £ 7 and c, £[(|X;| - c)+] = E[(\Xi\ - c)7 { | X .|> c} ] < K(c). 

Therefore, E[ip(\X t \)} < E^=i^[(l^i| ~ c»)+] < E^°=i 2_n < 1- Th us, the function 93 has the 
required properties. 

The converse is now proved. Suppose ip exists with the desired properties, and let e > 0. Select 
c so large that u > c implies that u < eip(u). Then, for all c > c , 

E[\X t \I {lxA > c} ] < eJ B^(|^|)7 { | Xi |> c} ] < eE[<p(\Xi\)] < eK 

Since this inequality holds for all i £ I, and e is arbitrary, it follows that (Xi : i £ 7) is u.i. 

10.6 A stopped random walk 

(a) By the fact Sq = and the choice of r, |5 n | < c for < n < r — 1. Since the W's are bounded 
by D, it follows that \S T \ < c + D. Therefore, 5„A r is a bounded sequence of random variables, 
so = E[S n /\ T ] — > 75 [St] asm 00, by the dominated convergence theorem. (Or we could invoke 
the first or third form of the optional stopping theorem discussed in class. Note here that we are 
taking it for granted that T < 00 with probability one. If we don't want to make that assumption, 
we could appeal to part (b) or some other method. ) 



^„ 2 + i " - 2 



so that 



(b) Let T n = a(W u ..., W n ). Then M n+1 -M n = 2W n+1 S n + 

E[M n+ i - M n \T n ] = 2E[W n+1 \J r n ]S n + E[W% +1 - a 2 } = 0, so that M is a martingale. Therefore, 
E[M TAn ] = for all n, or E\r A n] = E [S 2 At ] /a 2 . Arguing as in part a, we see that the sequence 
l^nArl < c + a A b for all n. Thus, E[t An] < ~ 2 for all n. But E[t An] ^ E[t] as n — > 00 by 

the monotone convergence theorem, so that E [r] < ~ 2 — . 

(c) We have that Ti'flS'n+i — 5 n || J^] = C for all n and E [r] < 00 by part (b), so by the third version 
of the optional stopping theorem discussed in class, 75[SV] = 75[5o] = 0. 

10.8 On the size of a maximum matching in a random bipartite graph 

(a) Many bounds are possible. Generally the tighter bounds are more complex. If d = 1, then 
every vertex in V of degree one or more can be included in a single matching, and such a matching 

is a maximum matching with expected size n(l — (1 ) n ) > n(l — e _1 ). The cardinality of a 

maximum matching cannot decrease as more edges are added, so a = ra(l — e _1 ) is a lower bound 
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for any d and n. An upper bound is given by the expected number of vertices of V with degree at 
least one, so b = n(l — (1 ) n ) ~ n(l — e~ d ) works. 

(b) The variable Z can be expressed as Z = F(Vi, . . . , V n ), where V% is the set of neighbors of 
Ui, and F satisfies the Lipschitz condition with constant 1. We are thus considering a martingale 
associated with Z by exposing the vertices of U one at a time. The (improved version of the) 
Azuma-Hoeffding inequality yields that P{\Z — E[Z]\ > "i^Jn) < 2e -27 . 

(c) Process the vertices of U one at a time. Select an unprocessed vertex in U. If the vertex has 
positive degree, select one of the edges associated with it and add it to the matching. Remove edges 
from the graph that become ineligible for the matching, and reduce the degrees of the unprocessed 
vertices correspondingly. (This algorithm can be improved by selecting the next vertex in U from 
among those with smallest reduced degree, and also selecting outgoing edges with endpoint in V 
having the smallest reduced degree in V, or both. If every matched edge has an endpoint with 
reduced degree one at the time of matching, the resulting matching has maximum cardinality. ) 
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autocorrelation function, see correlation function 
autocovariance function, see covariance function 

baseband 

random process, 265 

signal, 264 
Baum- Welch algorithm, 160 
B ayes' formula, 6 
Bernoulli distribution, 21 
binomial distribution, 22 
Borel sets, 3 
Borel-Cantelli lemma, 7 
bounded convergence theorem, 343 
bounded input bounded output (bibo) stability, 

253 
Brownian motion, 111 

Cauchy 

criterion for convergence, 52, 334 

criterion for m.s. convergence in correlation 
form, 54 

sequence, 53, 334 
central limit theorem, 58 
characteristic function 

of a random variable, 21 

of a random vector, 75 
Chebychev inequality, 20 
circular symmetry, 273 

joint, 273 
completeness 

of a probability space, 208 

of the real numbers, 334 
conditional 

expectation, 28, 80 

mean, see conditional expectation 



pdf, 27 

probability, 5 
conjugate prior, 147 
continuity 

of a function, 336 

of a function at a point, 336 

of a random process, 210 

of a random process at a point, 209 

piecewise m.s., 213 
convergence of sequences 

almost sure, 41 

deterministic, 332 

in distribution, 47 

in probability, 43 

mean square, 43, 231 
convex function, 59 
convolution, 253 
correlation 

coefficient, 28 

cross correlation matrix, 73 

function, 105 

matrix, 73 
count times, 113 
countably infinite, 332 
counting process, 113 
covariance 

cross covariance matrix, 73 

function, 105, 275 

matrix, 74 

pseudo-covariance function, 275 

pseudo-covariance matrix, 274 
Cramer's theorem, 62 

cumulative distribution function (CDF), 8, 26, 
105 
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derivative 

right-hand, 337 
differentiable, 337 

at a point, 337 

continuous and piecewise continuously, 338 

continuously, 338 

m.s. at a point, 213 

m.s. continuous and piecewise continuously, 
217 

m.s. continuously, 213 

m.s. sense, 213 
Dirichlet density, 147 
discrete-time random process, 105 
dominated convergence theorem, 344 
drift vector, 186, 194 

energy spectral density, 256 
Erlang B formula, 185 
Erlang C formula, 185 
expectation, 17 

of a random vector, 73 
expectation-maximization (EM) algorithm, 148 
exponential distribution, 23 

failure rate function, 25 
Fatou's lemma, 344 
forward-backard algorithm, 156 
Fourier transform, 255 

inversion formula, 255 

Parseval's identity, 255 
fundamental theorem of calculus, 340 

gambler's ruin problem, 109 
gamma distribution, 24 
Gaussian 

distribution, 23 

joint pdf, 86 

random vector, 85 
geometric distribution, 22 

implicit function theorem, 339 
impulse response function, 252 
independence 
events, 5 



pair wise, 5 
independent increment process, 110 
infimum, 334 
information update, 93 
inner product, 346, 348 
integration 

Riemann-Stieltjes, 342 

Lebesgue, 341 

Lebesgue-Stieltjes, 342 

m.s. Riemann, 218 

Riemann, 339 
inter count times, 113 

Jacobian matrix, 30 

Jensen's inequality, 59 

joint Gaussian distribution, 85 

jointly Gaussian random variables, 85 

Kalman filter, 91 

law of total probability, 6 
law of large numbers, 57 

strong law, 57 

weak law, 57 
liminf, or limit inferior, 335, 336 
limit points, 335, 336 
limsup, or limit superior, 335, 336 
linear innovations sequence, 91 
Lipschitz condition, 319 
Little's law, 181 

log moment generating function, 60 
log-sum inequality, 151 

Markov inequality, 20 
Markov process, 121 

aperiodic, 170 

birth-death process, 173 

Chapman-Kolmogorov equations, 124 

equilibrium distribution, 124 

generator matrix, 127 

holding times, 130 

irreducible, 169 

jump process, 130 

Kolmogorov forward equations, 128 
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nonexplosive, 173 

null recurrent, 170, 174 

one-step transition probability matrix, 125 

period of a state, 170 

positive recurrent, 170, 174 

pure-jump for a countable state space, 173 

pure-jump for a finite state space, 126 

space-time structure, 130, 131 

stationary, 124 

time homogeneous, 124 

transient, 170, 174 

transition probabilities, 124 

transition probability diagram, 125 

transition rate diagram, 127 
martingale, 110 
matrices, 345 

characteristic polynomial, 347, 349 

determinant, 346 

diagonal, 345 

eigenvalue, 346, 347, 349 

eigenvector, 346 

Hermitian symmetric, 349 

Hermitian transpose of, 347 

identity matrix, 345 

positive semidefinite, 347, 349 

symmetric, 345 

unitary, 348 
maximum, 334 

maximum a posteriori probability (MAP) esti- 
mator, 144 
maximum likelihood (ML) estimator, 143 
mean, see expectation 
Mean ergo die, 225 
mean function, 105 
mean square closure, 285 

memoryless property of the geometric distribu- 
tion, 22 
minimum, 334 
monotone convergence theorem, 345 

narrowband 

random process, 270 
signal, 267 



norm 

of a vector, 346 

of an interval partition, 340 
normal distribution, see Gaussian distribution 
Nyquit sampling theorem, 264 

orthogonal, 348 

complex random variables, 231 

random variables, 75 

vectors, 346 
orthogonality principle, 76 
orthonormal, 346 

basis, 346, 348 

matrix, 346 

system, 232 

Parseval's relation, 234 

partition, 6 

periodic WSS random processes, 240, 242 

permutation, 346 

piecewise continuous, 337 

Poisson arrivals see time averages (PASTA), 184 

Poisson distribution, 22 

Poisson process, 113 

posterior, or a posteriori, 144 

power 

of a random process, 257 

spectral density, 256 
prior, or a priori, 144 
probability density function (pdf), 26 
projection, 76 

Rayleigh distribution, 24, 26 
Riemann sum, 339 

sample path, 105 

Schwarz's inequality, 28 

second order random process, 106 

sine function, 258 

span, 346 

spectral representation, 241, 244 

stationary, 116, 232 

wide sense, 116, 275 
strictly increasing, 333 
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subsequence, 335 
supremum, 334 

Taylor's theorem, 338 

time update, 93 

time-invariant linear system, 253 

tower property, 82 

transfer function, 256 

triangle inequality, L 2 , 28 

uniform distribution, 24 
uniform prior, 145 

version 

of a random process, 240, 243 
Viterbi algorithm, 158 

wide sense sationary, 275 

wide sense stationary, 116 

Wiener process, see Brownian motion 
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